AW: DC metadata

Koch Martina Tue, 22 Sep 2009 23:42:30 -0700

Hi,

I don't know the howto you're referring to but I think it belongs to an older 
version of Nutch.


Let me try to explain...

doc.add("key","value")  -  adds a new field to the document "doc" with the name 
"key" and the value "value". With that knowledge the indexer just knows there 
is another field to be added, but it doesn't know if it should be stored, 
tokenized, termvectored and so on.
In order to tell the indexer how to index this field, you have to add a new 
line to the "addIndexBackendOptions(Configuration conf) method. This method is 
specified in every indexing filter.

Example:
public void addIndexBackendOptions(Configuration conf) {
        LuceneWriter.addFieldOptions("key", 
LuceneWriter.STORE.YES,LuceneWriter.INDEX.NO, conf);
        LuceneWriter.addFieldOptions("key2", 
LuceneWriter.STORE.NO,LuceneWriter.INDEX.TOKENIZED,LuceneWriter.VECTOR.POS, 
conf);
}

You need a parsing filter to extract data from the URLs you're crawling. I'm 
not aware of a DC metadata parser, so you need to write a parsing filter first, 
to extract the relevant data for you. Then you can index this data with the 
indexing filter you wrote.

Hope this helps.

Kind regards,
Martina




-----Ursprüngliche Nachricht-----
Von: BELLINI ADAM [mailto:mbel...@msn.com] 
Gesendet: Dienstag, 22. September 2009 23:08
An: nutch-user@lucene.apache.org
Betreff: RE: DC metadata


any idea guys ! i'm just stuck here :(

mbel...@msn.com




From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: RE: DC metadata
Date: Fri, 18 Sep 2009 14:12:35 +0000








hi again 

i just copied the directory of my new plugin 'which contains the jar file and 
the plugin.xml' to the nutch/plugins directory , and when i index now it gives 
me this error :

2009-09-18 10:03:44,754 WARN  mapred.LocalJobRunner - job_local_0024
java.lang.IllegalArgumentException: it doesn't make sense to have a field that 
is neither indexed nor stored
        at org.apache.lucene.document.Field.<init>(Field.java:279)
        at 
org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133)
        at 
org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239)
        at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:54)
        at 
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
        at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
        at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158)
        at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)


should i write a parser plugin too ??

thx



From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: DC metadata
Date: Thu, 17 Sep 2009 18:30:23 +0000








hi,
i'm trying to add Dublingcode metadata to my index, i wrote the plugin as 
descriped at http://wiki.apache.org/nutch/CreateNewFilter

and i build the project using ant...
but when crawled my intranet i can't find the DoublingCode metadata in my index 
??
did i missunderstand something ?

thx
                                          
Windows Live helps you keep up with all your friends,  in one place.            
                          
We are your photos. Share us now with  Windows Live Photos.                     
                  
_________________________________________________________________
Create a cool, new character for your Windows LiveT Messenger. 
http://go.microsoft.com/?linkid=9656621

AW: DC metadata

Reply via email to