Solr is really designed for this kind of processing and configurability; the Solr connector is just concerned about getting the documents to Solr. So I think your best bet is to either use existing Solr pipeline infrastructure, or write your own update handler that does what you need. (Obviously the former is preferred over the latter...)
Karl On Mon, Dec 5, 2011 at 1:45 PM, Michael Kelleher <[email protected]> wrote: > I am crawling a bunch of HTML pages within a site, that will be sent to Solr > for indexing. I want to extract some content out of the pages, each piece > of content to be stored as its own field BEFORE indexing in Solr. > > My guess would be that I should use a Document processing pipeline in Solr > like UIMA, or something of the like. > > However, to limit the amount of load on Solr, I was wondering if there was a > way to "hook" into the Solr connector to create these additional fields / > handle this processing. Maybe this would be an "extended" Solr connector > that I would create. > > Or should this really be done within Solr, because Solr already handles this > kind of processing? > > Any guidance / help would be great. > > thanks.
