Re: Accessing crawled data

Claudio Martella Wed, 23 Dec 2009 02:57:46 -0800

Andrzej Bialecki wrote:
> On 2009-12-22 16:07, Claudio Martella wrote:
>> Andrzej Bialecki wrote:
>>> On 2009-12-22 13:16, Claudio Martella wrote:
>>>> Yes, I'am aware of that. The problem is that i have some fields of the
>>>> SolrDocument that i want to compute by text analysis (basically i want
>>>> to do some smart keywords extraction) so i have to get in the middle
>>>> between crawling and indexing! My actual solution is to dump the
>>>> content
>>>> in a file through the segreader, parse it and then use SolrJ to
>>>> send the
>>>> documents. Probably the best solution is to set my own analyzer for
>>>> the
>>>> field on solr side, and do keywords extraction there.
>>>>
>>>> Thanks for the script, you'll use it!
>>>
>>> Likely the solution that you are looking for is an IndexingFilter -
>>> this receives a copy of the document with all fields collected just
>>> before it's sent to the indexing backend - and you can freely modify
>>> the content of NutchDocument, e.g. do additional analysis,
>>> add/remove/modify fields, etc.
>>>
>> This sounds very interesting. So the idea is to take the NutchDocument
>> as it comes out of the crawling and modify it (inside of an
>> IndexingFilter) before it's sent to indexing (inside of nutch),  right?
>
> Correct - IndexingFilter-s work no matter whether you use Nutch or
> Solr indexing.
>
>> So how does it relate to nutch schema and solr schema? Can you give me
>> some pointers?
>>
>
> Please take a look at how e.g. the index-more filter is implemented -
> basically you need to copy this filter and make whatever modifications
> you need ;)
>
> Keep in mind that any fields that you create in NutchDocument need to
> be properly declared in schema.xml when using Solr indexing.
>
Ok, I understand the rational behind this.
Another question for me is how to setup the pipeline. For instance i
want to first run the LanguageIdentifier and then move the content to a
particular field (imagine something called "content-$lang", as i want
each content-* field to have its own filtering (stopwords, stemming etc)
on solr side).
I still don't get where i can decide when the languageidentifier gets in
and when my own filter would.


thanks

Claudio

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
[email protected] http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to [email protected] in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.

Re: Accessing crawled data

Reply via email to