I am crawling a bunch of HTML pages within a site, that will be sent to
Solr for indexing. I want to extract some content out of the pages,
each piece of content to be stored as its own field BEFORE indexing in Solr.
My guess would be that I should use a Document processing pipeline in
Solr like UIMA, or something of the like.
However, to limit the amount of load on Solr, I was wondering if there
was a way to "hook" into the Solr connector to create these additional
fields / handle this processing. Maybe this would be an "extended" Solr
connector that I would create.
Or should this really be done within Solr, because Solr already handles
this kind of processing?
Any guidance / help would be great.
thanks.