Document Processing

Michael Kelleher Mon, 05 Dec 2011 10:46:04 -0800

I am crawling a bunch of HTML pages within a site, that will be sent toSolr for indexing. I want to extract some content out of the pages,each piece of content to be stored as its own field BEFORE indexing in Solr.

My guess would be that I should use a Document processing pipeline inSolr like UIMA, or something of the like.

However, to limit the amount of load on Solr, I was wondering if therewas a way to "hook" into the Solr connector to create these additionalfields / handle this processing. Maybe this would be an "extended" Solrconnector that I would create.

Or should this really be done within Solr, because Solr already handlesthis kind of processing?


Any guidance / help would be great.

thanks.

Document Processing

Reply via email to