Hi all, I am having problem with using parse-xml plugin with nutch 0.9 with a 5-node hadoop to process some XMl documents. It is causing a huge slow down at the crawl-reduce stage (to the point that it is sometime causing node timeout)
My xmlparser-conf.xml would separate large number of tags into different fields. e.g. ------------------------------8<--------------------------- <nutchXmlParser> <xmlIndexerProperties type="filePerDocument" namespace="http://purl.org/dc/elements/1.1/"> <field name="dctitle" xpath="//dc:title" type="Text" boost="1.4"/> <field name="dccreator" xpath="//dc:creator" type="keyword" boost="1.0"/> </xmlIndexerProperties> <xmlIndexerProperties type="filePerDocument" namespace="default"> <field name="tag1" xpath="//tag1" type="Text" boost="1.0"/> <field name="tag2" xpath="//tag2" type="Text" boost="1.0"/> <field name="tag3" xpath="//tag3" type="Text" boost="1.0"/> <!--.... etc. about 100 of these, where these tag represent different types of data --> </xmlIndexerProperties> </nutchXmlParser> ------------------------------8<--------------------------- The aim is to allow doing searches such as "tag1:data" with query-more plugin. (Please do correct me if i am using the term "field" wrongly here) I can confirm that the problem does not manifest itself when only indexing into small number of different fields (around 10) by limiting the different field tags in xmlparser-conf.xml I wonder if this is because: 1. large number of different fields is bad in nutch? - anybody had experience with dealing with large number of different fields (100+) in the index? 2. parse-xml is inefficient at generating parse data for large number of fields? - would anybody who have experience with parse-xml plugin have any comment? Many thanks for the help in advance. Please do let me know if you require more info - I am relatively new to nutch but I am very excited about its potential. Cheers boris
