I am crawling a bunch of HTML pages within a site, that will be sent to Solr for indexing. I want to extract some content out of the pages, each piece of content to be stored as its own field BEFORE indexing in Solr.

My guess would be that I should use a Document processing pipeline in Solr like UIMA, or something of the like.

However, to limit the amount of load on Solr, I was wondering if there was a way to "hook" into the Solr connector to create these additional fields / handle this processing. Maybe this would be an "extended" Solr connector that I would create.

Or should this really be done within Solr, because Solr already handles this kind of processing?

Any guidance / help would be great.

thanks.

Reply via email to