Re: Document Processing

Karl Wright Mon, 05 Dec 2011 10:51:55 -0800

Solr is really designed for this kind of processing and
configurability; the Solr connector is just concerned about getting
the documents to Solr.  So I think your best bet is to either use
existing Solr pipeline infrastructure, or write your own update
handler that does what you need.  (Obviously the former is preferred
over the latter...)


Karl


On Mon, Dec 5, 2011 at 1:45 PM, Michael Kelleher <[email protected]> wrote:
> I am crawling a bunch of HTML pages within a site, that will be sent to Solr
> for indexing.  I want to extract some content out of the pages, each piece
> of content to be stored as its own field BEFORE indexing in Solr.
>
> My guess would be that I should use a Document processing pipeline in Solr
> like UIMA, or something of the like.
>
> However, to limit the amount of load on Solr, I was wondering if there was a
> way to "hook" into the Solr connector to create these additional fields /
> handle this processing.  Maybe this would be an "extended" Solr connector
> that I would create.
>
> Or should this really be done within Solr, because Solr already handles this
> kind of processing?
>
> Any guidance / help would be great.
>
> thanks.

Re: Document Processing

Reply via email to