Re: [VOTE 2] Board resolution for Nutch as TLP
On Mon, Apr 12, 2010 at 14:08, Andrzej Bialecki a...@getopt.org wrote: Hi, Take two, after s/crawling/search/ ... Following the discussion, below is the text of the proposed Board Resolution to vote upon. [] +1. Request the Board make Nutch a TLP [] +0. I don't feel strongly about it, but I'm okay with this. [] -1. No, don't request the Board make Nutch a TLP, and here are my reasons... This is a majority count vote (i.e. no vetoes). The vote is open for 72 hours. Here's my +1. And here is my +1. === X. Establish the Apache Nutch Project WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web search platform for distribution at no charge to the public. NOW, THEREFORE, BE IT RESOLVED, that a Project Management Committee (PMC), to be known as the Apache Nutch Project, be and hereby is established pursuant to Bylaws of the Foundation; and be it further RESOLVED, that the Apache Nutch Project be and hereby is responsible for the creation and maintenance of software related to a large-scale web search platform; and be it further RESOLVED, that the office of Vice President, Apache Nutch be and hereby is created, the person holding such office to serve at the direction of the Board of Directors as the chair of the Apache Nutch Project, and to have primary responsibility for management of the projects within the scope of responsibility of the Apache Nutch Project; and be it further RESOLVED, that the persons listed immediately below be and hereby are appointed to serve as the initial members of the Apache Nutch Project: • Andrzej Bialecki a...@... • Otis Gospodnetic o...@... • Dogacan Guney doga...@... • Dennis Kubes ku...@... • Chris Mattmann mattm...@... • Julien Nioche jnio...@... • Sami Siren si...@... RESOLVED, that the Apache Nutch Project be and hereby is tasked with the migration and rationalization of the Apache Lucene Nutch sub-project; and be it further RESOLVED, that all responsibilities pertaining to the Apache Lucene Nutch sub-project encumbered upon the Apache Lucene Project are hereafter discharged. NOW, THEREFORE, BE IT FURTHER RESOLVED, that Andrzej Bialecki be appointed to the office of Vice President, Apache Nutch, to serve in accordance with and subject to the direction of the Board of Directors and the Bylaws of the Foundation until death, resignation, retirement, removal or disqualification, or until a successor is appointed. === -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: [DISCUSS] Board resolution for Nutch as TLP
Hi, On Sat, Apr 10, 2010 at 16:32, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Fri, Apr 9, 2010 at 6:52 PM, Andrzej Bialecki a...@getopt.org wrote: WHEREAS, the Board of Directors deems it to be in the best interests of the Foundation and consistent with the Foundation's purpose to establish a Project Management Committee charged with the creation and maintenance of open-source software related to a large-scale web crawling platform for distribution at no charge to the public. Would it make sense to simplify the scope to ... open-source software related to large-scale web crawling for distribution at no charge to the public? Actually, shouldn't that be something like web search platform, or maybe a crawling and search platform? Nutch is not just a crawler. Anyway, +1 from me. BR, Jukka Zitting -- Doğacan Güney
Re: Nutch 2.0 roadmap
On Wed, Apr 7, 2010 at 20:32, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-07 18:54, Doğacan Güney wrote: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Hmm .. this puzzles me, do you think we should port changes from 1.1 to nutchbase? I thought we should do it the other way around, i.e. merge nutchbase bits to trunk. Hmm, I am a bit out of touch with the latest changes but I know that the differences between trunk and nutchbase are unfortunately rather large right now. If merging nutchbase back into trunk would be easier then sure, let's do that. * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. Again, the advantage of DataNucleus is that we don't have to handcraft all the mid- to low-level mappings, just the mid-level ones (JOQL or whatever) - the cost of maintenance is lower, and the number of backends that are supported out of the box is larger. Of course, this is just IMHO - we won't know for sure until we try to use both your custom ORM and DataNucleus... I am obviously a bit biased here but I have no strong feelings really. DataNucleus is an excellent project. What I like about avro-based approach is the essentially free MapReduce support we get and the fact that supporting another language is easy. So, we can expose partial hbase data through a server and a python-client can easily read/write to it, thanks to avro. That being said, I am all for DataNucleus or something else. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Nutch 2.0 roadmap
Hi, On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote: Just a question ? Will the new HBase implementation allow more sophisticated crawling strategies than the current score based. Give you a few example of what I'd like to do : Define different crawling frequency for different set of URLs, say weekly for some url, monthly or more for others. Select URLs to re-crawl based on attributes previously extracted.Just one example: recrawl urls that contained a certain keyword (or set of) Select URLs that have not yet been crawled, at the frontier of the crawl therefore At some point, it would be nice to change generator so that it is only a handful of methods and a pig (or something else) script. So, we would provide most of the functions you may need during generation (accessing various data) but actual generation would be a pig process. This way, anyone can easily change generate any way they want (even make it more jobs than 2 if they want more complex schemes). 2010/4/7, Doğacan Güney doga...@gmail.com: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical tag etc... Plus a better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? Definitely. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney -- -MilleBii- -- Doğacan Güney
Re: Nutch 2.0 roadmap
On Thu, Apr 8, 2010 at 21:11, MilleBii mille...@gmail.com wrote: Not sure what u mean by pig script, but I'd like to be able to make a multi-criteria selection of Url for fetching... I mean a query language like http://hadoop.apache.org/pig/ if we expose data correctly, then you should be able to generate on any criteria that you want. The scoring method forces into a kind of mono dimensional approach which is not really easy to deal with. The regex filters are good but it assumes you want select URLs on data which is in the URL... Pretty limited in fact I basically would like to do 'content' based crawling. Say for example: that I'm interested in topic A. I'd'like to label URLs that match Topic A (user supplied logic). Later on I would want to crawl topic A urls at a certain frequency and non labeled urls for exploring in a different way. This looks like hard to do right now 2010/4/8, Doğacan Güney doga...@gmail.com: Hi, On Wed, Apr 7, 2010 at 21:19, MilleBii mille...@gmail.com wrote: Just a question ? Will the new HBase implementation allow more sophisticated crawling strategies than the current score based. Give you a few example of what I'd like to do : Define different crawling frequency for different set of URLs, say weekly for some url, monthly or more for others. Select URLs to re-crawl based on attributes previously extracted.Just one example: recrawl urls that contained a certain keyword (or set of) Select URLs that have not yet been crawled, at the frontier of the crawl therefore At some point, it would be nice to change generator so that it is only a handful of methods and a pig (or something else) script. So, we would provide most of the functions you may need during generation (accessing various data) but actual generation would be a pig process. This way, anyone can easily change generate any way they want (even make it more jobs than 2 if they want more complex schemes). 2010/4/7, Doğacan Güney doga...@gmail.com: Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical
Re: Nutch 2.0 roadmap
Hey everyone, On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-06 15:43, Julien Nioche wrote: Hi guys, I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808https://issues.apache.org/jira/browse/NUTCH-808 ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. * new functionalities e.g. sitemap support, canonical tag etc... Plus a better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? Definitely. :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: [ANNOUNCE] New Nutch Committer: Julien Nioche
On Fri, Dec 25, 2009 at 21:48, Julien Nioche lists.digitalpeb...@gmail.comwrote: Hi, Thank you for the warm welcome, I feel very honoured to have been made a Nutch committer. Congratulations and welcome :) ! A few lines aboyut myself : I started using Lucene back in 2001, made a few small contributions to it and started LIMO - an open source web application used for monitoring Lucene indices. Over the last 3 years I have used quite a few Apache projects such as SOLR, UIMA and of course Nutch, which I recently used for a large scale crawling project involving a 400 node cluster and 15 billions URLs fetched. My activities at DigitalPebble also cover Natural Language Processing (which is my initial background), text analysis and I recently started an open source project named Behemoth which allows to scale text analysis applications using Hadoop. There are quite a few exciting things planned for Nutch in the short term and I really look forward to contributing to it in the new year. Happy Christmas and best wishes for 2010! Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/24 Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov All, A little while ago I nominated Julien Nioche to be Nutch committer based on his contributions to the Nutch project (10+ patches in this release alone, and all the mailing list help and thoughtful design discussion). I'm happy to announce that the Lucene PMC has voted to make Julien a Nutch committer! Julien, welcome to the team. The typical first committer task is to modify the Nutch Forrest credits page and add yourself to the website. If you'd like to say something about yourself and your background, feel free to do so as well. Welcome! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ http://sunset.usc.edu/%7Emattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Doğacan Güney
Re: State of nutchbase
Hey everyone, So I restarted nutchbase efforts with adding an abstraction to the hbase api. The idea is to use an intermediate nutch api (which then talks with hbase) instead of communicating with hbase directly. This allows us a) to not be completely tied down to hbase, making a move to another db in the future easier b) perhaps to immediately support multiple databases with easy data migration between them. What I have is very very (VERY) early and extremely alpha but I am quite happy with overall idea so I am sharing it for suggestions and reviews. Again, instead of using hbase directly, nutch will use a nice java bean with getters and setters. Nutch will then figure out what to read/write into hbase. I decided to use avro because it has a very clean design. Here is a very basic WebTableRow class: {namespace: org.apache.nutch.storage, protocol: Web, types: [ {name: WebTableRow, type: record, fields: [ {name: rowKey, type: string}, {name: fetchTime, type: long}, {name: title, type: string}, {name: text, type: string}, {name: status, type: int} ] } ] } (ignore protocol. I haven't yet figured out how to compile schemas without protocols) I have copied and modified avro's SpecificCompiler to generate a java class. It is mostly the same class as avro's SpecificCompiler however the variables are all private and are accessed through getters and setters. Here is a portion of the file: public class WebTableRow extends NutchTableRow Utf8 implements SpecificRecord { @RowKey // these are used for reflection private Utf8 rowKey; @RowField private long fetchTime; @RowField private Utf8 title; @RowField private Utf8 text; @RowField private int status; public Utf8 getRowKey() { } public void setRowKey(Utf8 value) {} public long getFetchTime() { } public void setFetchTime(long value) { } . Note that NutchTableRow extends SpecificRecordBase so this is a proper avro record. In the future, once hadoop MR supports avro as a serialization format NutchTableRow-s can easily be output through maps and reduces which is a nice bonus. We need to force the usage of setters instead of direct access to variables. Because one of the nice things about hbase is that you only update the columns that you changed. However to know which fields are updated (and thus, map them to hbase columns), we must keep track of what changed. Currently, NutchTableRow keeps a BitSet for all fields and all setter functions update this BitSet so we know exactly what changed. There is also a new interface called NutchSerializer that defines readRow and writeRow methods(it also needs scans, delete rows etc.. but that's for later). Currently HbaseSerializer implements NutchSerializer and reads and writes WebTableRow-s. HbaseSerializer currently works via reflection. It should be easy to add code generation to our SpecificCompiler so that we can also output a WebTableRowHbaseSerializer along with WebTableRow instead of using reflection. What I have currently can read and write primitive types + strings into and from hbase. You can check it out from github.com/dogacan/nutchbase (branch master, package o.a.n.storage). Again, I would like to note that the code is very very alpha and is not in a good shape but it should be a good starting point if you are interested. Once hbase support is solid, I intend to add support for other databases (bdb, cassandra and sql come to mind). If I got everything right, then moving data from one database to another is an incredibly trivial task. So, you can start with, say, bdb then switch over to hbase once your data gets large. Oh I forgot... HbaseSerializer reads a hbase-mapping.xml file that describes the mapping between fields and hbase columns: table name=webtable class=org.apache.nutch.storage.WebTableRow description family name=p/ !-- This can also have params like compression, bloom filters -- family name=f/ /description fields field name=fetchTime family=f qualifier=ts/ field name=title family=p qualifier=t/ field name=text family=p qualifier=c/ field name=status family=f qualifier=st/ /fields Sorry for the long and rambling email. Feel free to ask if anything is unclear (and I assume it must be, given my incoherent description :) -- Doğacan Güney
About NUTCH-650 (hbase integration)
Hey list, I intended to merge in NUTCH-650 last week but stuff got in the way. However, I am very close to finishing all the work so give me a few more days and NUTCH-650 will be in (along with a guide in wiki). -- Doğacan Güney
Re: Nutch dev. plans
Hey guys, Kirby, thanks for all the insightful information! I couldn't comment much as most of the stuff went right over my head :) (don't know much about OSGI). Andrzej, would OSGI make creating plugins easier? One of the things that bug me most about our plugin system is the xml files that need to be created for every plugin. These files have to be written manually and nutch doesn't report errors very good here so this process is extremely error-prone. Do you have something in mind for making this part any simpler? On Sun, Jul 26, 2009 at 19:09, Andrzej Bialeckia...@getopt.org wrote: [..snipping thread as it has gone too long.] -- Doğacan Güney
Re: Running the Crawl without using bin/nutch in side a scala program
- NONE 2009-07-27 18:49:19,689 WARN mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.init(URLNormalizers.java:122) at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) how to solve this issue any idea please reply to this... I think $nutch/build/plugins is not in your classpath, but I am not sure. Thanks in advance.. Sailaja DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails. -- Doğacan Güney
Re: Server suggestion
Hi Dennis, On Fri, Jul 24, 2009 at 16:46, Dennis Kubesku...@apache.org wrote: fredericoagent wrote: If I want to setup nutch with lets say 400 million urls in the database. Is it better to have a 4-5 super fast and loaded servers or have 12-15 smaller , cheaper servers. More smaller servers. Make sure they are energy efficient though and have a decent amount of Ram. If a server goes down, you aren't affected as much. By superfast I mean cpu is latest quad core or latest six core processor with 6 Gigs Ram and 1. or 1.5 TB HD. By cheap I mean something like a Xeon quad core 2.26 cpu with 3 Gig Ram and 500 Sata HD. or if anyone can suggest a better spec ideal Our first servers were 1Ghz (Yes really) running hadoop 0.04 way back when. Our first production clusters were core2, 4G ECC, 1 750G hard drive. These days been building i7 8-core, 12G ECC, 4T raid-5 machines with up to 8 disks, 2U for around 2200.00 each. If you are looking for a good server builder check out swt.com. They are supermicro resellers and build solid machines. It suggests here: http://en.wikipedia.org/wiki/Core_i7#Drawbacks that core i7's do not support ECC rams. Have you ran into any issues or is WP wrong here? Suggestions. Don't skimp on the hard drive, do at least 750G or more. Price difference is negligible. Do at least 2G Ram, 4G is better, 8G is better than that. You can get up to 12G on regular motherboards these days. After that it gets much more expensive. Ao more recent processors, such as core2 or i7. They are more power efficient per processing unit. If you want a really fast machine, do multiple disks in a raid-5 format. Dennis -- Doğacan Güney
Re: Nutch dev. plans
Hey list, On Fri, Jul 17, 2009 at 16:55, Andrzej Bialeckia...@getopt.org wrote: Hi all, I think we should be creating a sandbox area, where we can collaborate on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan will be importing his HBase work as 'nutchbase'. Tika work is the least disruptive, so it could occur even on trunk. OSGI plugins work (which I'd like to tackle) means significant refactoring so I'd rather put this on a branch too. Thanks for starting the discussion, Andrzej. Can you detail your OSGI plugin framework design? Maybe I missed the discussion but updating the plugin system has been something that I wanted to do for a long time :) so I am very much interested in your design. Dogacan, you mentioned that you would like to work on Katta integration. Could you shed some light on how this fits with the abstract indexing searching layer that we now have, and how distributed Solr fits into this picture? I haven't yet given much thought to Katta integration. But basically, I am thinking of indexing newly-crawled documents as lucene shards and uploading them to katta for searching. This should be very possible with the new indexing system. But so far, I have neither studied katta too much nor given much thought to integration. So I may be missing obvious stuff. About distributed solr: I very much like to do this and again, I think, this should be possible to do within nutch. However, distributed solr is ultimately uninteresting to me because (AFAIK) it doesn't have the reliability and high-availability that hadoophbase have, i.e. if a machine dies you lose that part of the index. Are there any projects going on that are live indexing systems like solr, yet are backed up by hadoop HDFS like katta? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Nutch dev. plans
On Fri, Jul 17, 2009 at 21:32, Andrzej Bialeckia...@getopt.org wrote: Doğacan Güney wrote: Hey list, On Fri, Jul 17, 2009 at 16:55, Andrzej Bialeckia...@getopt.org wrote: Hi all, I think we should be creating a sandbox area, where we can collaborate on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan will be importing his HBase work as 'nutchbase'. Tika work is the least disruptive, so it could occur even on trunk. OSGI plugins work (which I'd like to tackle) means significant refactoring so I'd rather put this on a branch too. Thanks for starting the discussion, Andrzej. Can you detail your OSGI plugin framework design? Maybe I missed the discussion but updating the plugin system has been something that I wanted to do for a long time :) so I am very much interested in your design. There's no specific design yet except I can't stand the existing plugin framework anymore ... ;) I started reading on OSGI and it seems that it supports the functionality that we need, and much more - it certainly looks like a better alternative than maintaining our plugin system beyond 1.x ... Couldn't agree more with the can't stand plugin framework :D Any good links on OSGI stuff? Oh, an additional comment about the scoring API: I don't think the claimed benefits of OPIC outweigh the widespread complications that it caused in the API. Besides, getting the static scoring right is very very tricky, so from the engineer's point of view IMHO it's better to do the computation offline, where you have more control over the process and can easily re-run the computation, rather than rely on an online unstable algorithm that modifies scores in place ... Yeah, I am convinced :) . I am not done yet, but I think OPIC-like scoring will feel very natural in a hbase-backed nutch. Give me a couple more days to polish the scoring API then we can change it if you are not happy with it. Dogacan, you mentioned that you would like to work on Katta integration. Could you shed some light on how this fits with the abstract indexing searching layer that we now have, and how distributed Solr fits into this picture? I haven't yet given much thought to Katta integration. But basically, I am thinking of indexing newly-crawled documents as lucene shards and uploading them to katta for searching. This should be very possible with the new indexing system. But so far, I have neither studied katta too much nor given much thought to integration. So I may be missing obvious stuff. Me too.. About distributed solr: I very much like to do this and again, I think, this should be possible to do within nutch. However, distributed solr is ultimately uninteresting to me because (AFAIK) it doesn't have the reliability and high-availability that hadoophbase have, i.e. if a machine dies you lose that part of the index. Grant Ingersoll is doing some initial work on integrating distributed Solr and Zookeeper, once this is in a usable shape then I think perhaps it's more or less equivalent to Katta. I have a patch in my queue that adds direct Hadoop-Solr indexing, using Hadoop OutputFormat. So there will be many options to push index updates to distributed indexes. We just need to offer the right API to implement the integration, and the current API is IMHO quite close. Are there any projects going on that are live indexing systems like solr, yet are backed up by hadoop HDFS like katta? There is the Bailey.sf.net project that fits this description, but it's dormant - either it was too early, or there were just too many design questions (or simply the committers moved to other things). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Upgrade to hadoop 0.20?
On Wed, Jul 8, 2009 at 11:13, Julien Nioche lists.digitalpeb...@gmail.comwrote: Good idea. OK, it turns out that we can't :D. MapFileOutputFormat (which we use heavily) is not yet upgraded to hadoop 0.20. We can start using hadoop 0.20 but since we have to use old deprecated APIs, it doesn't make much sense to me. 2009/7/8 Doğacan Güney doga...@gmail.com Hey list, Does anyone have any objections to upgrading to hadoop 0.20? As you may know, they have completely overhauled the MapReduce API(they still keep old API around but it is deprecated). There is a lot of mundane work to do to change all our MR code to new API but I can do that. So what do you guys think? -- Doğacan Güney -- DigitalPebble Ltd http://www.digitalpebble.com -- Doğacan Güney
Upgrade to hadoop 0.20?
Hey list, Does anyone have any objections to upgrading to hadoop 0.20? As you may know, they have completely overhauled the MapReduce API(they still keep old API around but it is deprecated). There is a lot of mundane work to do to change all our MR code to new API but I can do that. So what do you guys think? -- Doğacan Güney
Re: Build failed in Hudson: Nutch-trunk #857
, Time elapsed: 1.983 sec [junit] Running org.apache.nutch.metadata.TestMetadata [junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 0.399 sec [junit] Running org.apache.nutch.metadata.TestSpellCheckedMetadata [junit] Tests run: 11, Failures: 0, Errors: 0, Time elapsed: 12.149 sec [junit] Running org.apache.nutch.net.TestURLFilters [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.449 sec [junit] Running org.apache.nutch.net.TestURLNormalizers [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.933 sec [junit] Running org.apache.nutch.ontology.TestOntologyFactory [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.498 sec [junit] Running org.apache.nutch.parse.TestOutlinkExtractor [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.306 sec [junit] Running org.apache.nutch.parse.TestParseData [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.289 sec [junit] Running org.apache.nutch.parse.TestParseText [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.375 sec [junit] Running org.apache.nutch.parse.TestParserFactory [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.258 sec [junit] Running org.apache.nutch.plugin.TestPluginSystem [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 2.139 sec [junit] Running org.apache.nutch.protocol.TestContent [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.506 sec [junit] Running org.apache.nutch.protocol.TestProtocolFactory [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.206 sec [junit] Running org.apache.nutch.searcher.TestHitDetails [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.284 sec [junit] Running org.apache.nutch.searcher.TestOpenSearchServlet [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.317 sec [junit] Running org.apache.nutch.searcher.TestQuery [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 1.042 sec [junit] Running org.apache.nutch.searcher.TestSummarizerFactory [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.905 sec [junit] Running org.apache.nutch.searcher.TestSummary [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.379 sec [junit] Running org.apache.nutch.util.TestEncodingDetector [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.627 sec [junit] Running org.apache.nutch.util.TestGZIPUtils [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.43 sec [junit] Running org.apache.nutch.util.TestNodeWalker [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 0.869 sec [junit] Test org.apache.nutch.util.TestNodeWalker FAILED [junit] Running org.apache.nutch.util.TestPrefixStringMatcher [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.346 sec [junit] Running org.apache.nutch.util.TestStringUtil [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.285 sec [junit] Running org.apache.nutch.util.TestSuffixStringMatcher [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.362 sec [junit] Running org.apache.nutch.util.TestURLUtil [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.743 sec BUILD FAILED http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build.xml:304: Tests failed! Total time: 5 minutes 37 seconds Publishing Javadoc Recording test results -- Doğacan Güney
Why does TestNodeWalker keep failing?
Hi all, Does anyone know why TestNodeWalker keeps failing for the last couple of days? I can reproduce the error in my computer; test log looks like this: Testsuite: org.apache.nutch.util.TestNodeWalker Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 1.101 sec - Standard Error - java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1241) at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source) at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source) at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown Source) at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.nutch.util.TestNodeWalker.testSkipChildren(TestNodeWalker.java:63) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766) - --- Testcase: testSkipChildren took 1.095 sec FAILED UL Content can NOT be found in the node junit.framework.AssertionFailedError: UL Content can NOT be found in the node at org.apache.nutch.util.TestNodeWalker.testSkipChildren(TestNodeWalker.java:79) I have no idea why we get a 503 there? -- Doğacan Güney
Re: IOException in dedup
On Tue, Jun 2, 2009 at 20:13, Nic M nicde...@gmail.com wrote: On Jun 2, 2009, at 12:41 PM, Ken Krugler wrote: Hello, I am new with Nutch and I have set up Nutch 0.9 on Easy Eclipse for Mac OS X. When I try to start crawling I get the following exception: Dedup: starting Dedup: adding indexes in: crawl/indexes Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) Does anyone know how to solve this problem? You may be running into this problem: https://issues.apache.org/jira/browse/NUTCH-525 I suggest trying updating to 1.0 or applying the patch there. You can get an IOException reported by Hadoop when the root cause is that you've run out of memory. Normally the hadoop.log file would have the OOM exception. If you're running from inside of Eclipse, see http://wiki.apache.org/nutch/RunNutchInEclipse0.9 for more details. -- Ken -- Ken Krugler +1 530-210-6378 Thank you for the pointers Ken. I changed the VM memory parameters as shown at http://wiki.apache.org/nutch/RunNutchInEclipse0.9. However, I still get the exception and in Hadoop log I have the following exception 2009-06-02 13:08:18,790 INFO indexer.DeleteDuplicates - Dedup: starting 2009-06-02 13:08:18,817 INFO indexer.DeleteDuplicates - Dedup: adding indexes in: crawl/indexes 2009-06-02 13:08:19,064 WARN mapred.LocalJobRunner - job_7izmuc java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113) at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176) at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126) I am running Lucene 2.1.0. Any idea why I am getting the ArrayIndexOutofBoundsEception? Nic -- Doğacan Güney
Re: Infinite loop bug in Nutch 0.9
On Wed, Apr 1, 2009 at 13:29, George Herlin ghher...@gmail.com wrote: Sorry, forgot to say, there is an added precondition to causing the bug: The redirection has to be fetched before the page it redirects to... if not, there will be a pre.existing crawl datum with an reasonable refetch-interval. Maybe this is something fixed between 0.9 and 1.0, but I think CrawlDbReducer fixes these datums, around line 147 (case CrawlDatum.STATUS_LINKED). Have you even got stuck in an infinite loop because of it? 2009/4/1 George Herlin ghher...@gmail.com Hello, there. I believe I may have found a infinite loop in Nutch 0.9. It happens when a site has a page that refers to itself through a redirection. The code in Fetcher.run(), around line 200 - sorry, my Fetcher has been a little modified, line numbers may vary a little - says, for that case: output(url, new CrawlDatum(), null, null, CrawlDatum.STATUS_LINKED); What that does is, inserts an extra (empty) crawl datum for the new url, with a re-fetch interval of 0.0. However, (see Generator.Selector.map(), particularly lines 144-145), the non-refetch condition used seems to be last-fetch+refetch-intervalnow ... which is always false if refetch-interval==0.0! Now, if there is a new link to the new url in that page, that crawl datum is re-used, and the whole thing loops indefinitely. I've fixed that for myself by changing the quoted line (twice) by: output(url, new CrawlDatum(CrawlDatum.STATUS_LINKED, 30f), null, null, CrawlDatum.STATUS_LINKED); and that works (btw the 30F should really be the value of db.default.fetch.interval, but I haven't the time right now to work out the issues, but in reality the default constructor and the appropriate updater method should, if I am right in analysing the algorithm always enforce a positive refetch interval. Of course, another method could be used to remove this self-reference, but that couls be complicated, as that may happen through a loop (2 or more pages etc..., you know what I mean). Has that been fixed already, and by what method? Best regards George Herlin -- Doğacan Güney
Re: [VOTE] Release Apache Nutch 1.0
So anyone else? Anyone? On Wed, Mar 25, 2009 at 17:17, Dennis Kubes ku...@apache.org wrote: +1, is this binding? :) Dog(acan Güney wrote: Another non-binding +1 from me. Hope this one is a keeper :D On Mon, Mar 23, 2009 at 22:28, Sami Siren ssi...@gmail.com mailto: ssi...@gmail.com wrote: Hello, I have packaged the third release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc2/http://people.apache.org/%7Esiren/nutch-1.0/rc2/ http://people.apache.org/%7Esiren/nutch-1.0/rc2/ See the CHANGES.txt[1] file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/ The following issues that were discovered during the review of last rc have been fixed: https://issues.apache.org/jira/browse/NUTCH-722 https://issues.apache.org/jira/browse/NUTCH-723 https://issues.apache.org/jira/browse/NUTCH-725 https://issues.apache.org/jira/browse/NUTCH-726 https://issues.apache.org/jira/browse/NUTCH-727 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Here's my +1 Thanks! [1] http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511 --Sami Siren -- Dog(acan Güney -- Doğacan Güney
Re: Announce: New PMC member Dennis Kubes
On Wed, Mar 25, 2009 at 12:24, Andrzej Bialecki a...@getopt.org wrote: Hi all, The Lucene Project Management Committee is happy to announce that Dennis Kubes has been voted in as a new PMC member. He is the third Nutch committer to represent this project there, and his experience and excellent work on Nutch will be also useful in the broader context of the whole Lucene project. Congratulations, Dennis! Congratulations! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Announce: New PMC member Dennis Kubes
Btw, can Dennis be the 3rd +1 that we need so we can finally release 1.0 :D ? On Wed, Mar 25, 2009 at 16:47, Mattmann, Chris A chris.a.mattm...@jpl.nasa.gov wrote: Here here. Deservedly so! Great job, Dennis! Cheers, Chris On 3/25/09 3:27 AM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Wed, Mar 25, 2009 at 11:24 AM, Andrzej Bialecki a...@getopt.org wrote: The Lucene Project Management Committee is happy to announce that Dennis Kubes has been voted in as a new PMC member. Hip, hip, hurray! Congratulations, Dennis! BR, Jukka Zitting ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Doğacan Güney
Re: [VOTE] Release Apache Nutch 1.0
Another non-binding +1 from me. Hope this one is a keeper :D On Mon, Mar 23, 2009 at 22:28, Sami Siren ssi...@gmail.com wrote: Hello, I have packaged the third release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc2/http://people.apache.org/%7Esiren/nutch-1.0/rc2/ See the CHANGES.txt[1] file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/ The following issues that were discovered during the review of last rc have been fixed: https://issues.apache.org/jira/browse/NUTCH-722 https://issues.apache.org/jira/browse/NUTCH-723 https://issues.apache.org/jira/browse/NUTCH-725 https://issues.apache.org/jira/browse/NUTCH-726 https://issues.apache.org/jira/browse/NUTCH-727 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Here's my +1 Thanks! [1] http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc2/CHANGES.txt?revision=757511 -- Sami Siren -- Doğacan Güney
Re: Problems compiling Nutch in Eclipse
RTF parser is not built by default because the jars it uses has some licensing issues. And it is out of sync with current trunk so it does not even build anymore. This issue may help: https://issues.apache.org/jira/browse/NUTCH-644 On Sat, Mar 21, 2009 at 03:02, Rodrigo Reyes C. rre...@corbitecso.com wrote: Hi I have configured my eclipse project as stated here http://wiki.apache.org/nutch/RunNutchInEclipse0.9 Still, I am getting the following errors: The return type is incompatible with Parser.getParse(Content) RTFParseFactory.java nutch/src/plugin/parse-rtf/src/java/org/apache/nutch/parse/rtf line 52 Java Problem Type mismatch: cannot convert from ParseResult to Parse TestRTFParser.java nutch/src/plugin/parse-rtf/src/test/org/apache/nutch/parse/rtf line 78 Java Problem Any ideas on what could be wrong? I already included both http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/ and http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/ jars. Thanks in advance -- Rodrigo Reyes C. -- Doğacan Güney
Re: [DISCUSS] contents of nutch release artifact
On Thu, Mar 19, 2009 at 23:46, Sami Siren ssi...@gmail.com wrote: Sami Siren wrote: Andrzej Bialecki wrote: How about the following: we build just 2 packages: * binary: this includes only base hadoop libs in lib/ (enough to start a local job, no optional filesystems etc), the *.job and *.war files and scripts. Scripts would check for the presence of plugins/ dir, and offer an option to create it from *.job. Assumption here is that this shouldbe enough to run full cycle in local mode, and that people who want to run a distributed cluster will first install a plain Hadoop release, and then just put the *.job and bin/nutch on the master. * source: no build artifacts, no .svn (equivalent to svn export), simple tgz. this sounds good to me. additionally some new documentation needs to be written too. I added a simple patch to NUTCH-728 to make a plain source release from svn, what do people think should we add the plain source package into next rc. I would not like to make changes to binary package now but propose that we do those changes post 1.0. +1 for including plain source release in next rc. As for, local/distributed separation, it is a good idea but I think we should hold it for 1.1 (or something else) if it requires architectural changes (thus needs review and testing). -- Sami Siren -- Doğacan Güney
Re: [DISCUSS] contents of nutch release artifact
On Thu, Mar 19, 2009 at 16:48, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Thu, Mar 19, 2009 at 3:38 PM, Andrzej Bialecki a...@getopt.org wrote: (anyway, what's a measly 90MB nowadays .. ;) It's a pretty long download unless you have a fast connection and a nearby mirror. I agree. Can't we also do a source-only release? Kind of like a checkout from svn (without, of course, svn bits)? I think this would be much more interesting to me if I wasn't using trunk. So, my suggestion is that we have 3 releases? Source only, binary only and full. BR, Jukka Zitting -- Doğacan Güney
Re: [VOTE] Release Apache Nutch 1.0
Again, my non-binding +1 :) On 10.Mar.2009, at 09:34, Sami Siren ssi...@gmail.com wrote: Hello, I have packaged the second release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc1/ See the CHANGES.txt[1] file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/?pathrev=752004 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Here's my +1 Thanks! [1] http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc1/CHANGES.txt?view=logpathrev=752004 -- Sami Siren
Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]
On 09.Mar.2009, at 11:05, Sami Siren ssi...@gmail.com wrote: Doğacan Güney wrote: On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com wrote: Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! That's great! I would like to see NUTCH-684 in but I guess I was too late :) Anyway, my non-binding +1. uh, I missed that one, sorry. Do you think it's ready to be included? (IMO that's an important feature) It's not a big deal for me to rebuild the package with that feature included. I only tested it on a small crawl. Still, I believe it is important too so I would like to include it. Worst case we release a 1.0.1 soon after:) -- Sami Siren
Re: NUTCH-684 [was: Re: [VOTE] Release Apache Nutch 1.0]
On Mon, Mar 9, 2009 at 17:46, Sami Siren ssi...@gmail.com wrote: Doğacan Güney wrote: On 09.Mar.2009, at 11:05, Sami Siren ssi...@gmail.com mailto:ssi...@gmail.com wrote: Doğacan Güney wrote: On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com mailto:ssi...@gmail.com wrote: Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ http://people.apache.org/%7Esiren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! That's great! I would like to see NUTCH-684 in but I guess I was too late :) Anyway, my non-binding +1. uh, I missed that one, sorry. Do you think it's ready to be included? (IMO that's an important feature) It's not a big deal for me to rebuild the package with that feature included. I only tested it on a small crawl. Still, I believe it is important too so I would like to include it. Worst case we release a 1.0.1 soon after:) I am fine either way. So if you think it's good enough to go in just commit it and I'll build another rc. If not then we can release it later too when it's ready. Committed, thanks for waiting :) -- Sami Siren -- Sami Siren -- Doğacan Güney
Re: [VOTE] Release Apache Nutch 1.0
On Sun, Mar 8, 2009 at 20:25, Sami Siren ssi...@gmail.com wrote: Hello, I have packaged the first release candidate for Apache Nutch 1.0 release at http://people.apache.org/~siren/nutch-1.0/rc0/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made from tag: http://svn.apache.org/viewvc/lucene/nutch/tags/release-1.0-rc0/?pathrev=751480 Please vote on releasing this package as Apache Nutch 1.0. The vote is open for the next 72 hours. Only votes from Lucene PMC members are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 1.0 [ ] -1 Do not release the packages because... Thanks! That's great! I would like to see NUTCH-684 in but I guess I was too late :) Anyway, my non-binding +1. -- Sami Siren -- Doğacan Güney
Re: Release 1.0?
On Sat, Feb 28, 2009 at 10:00, Sami Siren ssi...@gmail.com wrote: dealmaker wrote: Hi, Is there going to be a delay of the 1.0 release? Today is almost Feb 28. You said that 1.0 will come in Feb. I am customizing Nutch 0.9, and I am wondering if I should wait couple more days for the 1.0 release. I think that no one else but me made any guesses about the release date? (since it is virtually impossible due to fact that this is not a paid project). The general consensus seems to be that we should get the next release out preferably sooner than later. I personally still think that the first release candidate is not that far away - we have no blocker issues left and it seems (judged by the lack of activity on working with those remaining issues) that the ones still there are not too important. I am going to commit NUTCH-669 soon and after that I am fine with starting the release process. Other devs might have different opinions. +1. I will finish solr dedup tomorrow, after that I have no more issues I want to address before 1.0. -- Sami Siren -- Sami Siren Thanks. Andrzej Bialecki wrote: Marko Bauhardt wrote: Hi, is there anybody out there? ;) exists a plan when version 1.0 will be released? thanks marko On Jan 28, 2009, at 9:45 AM, Marko Bauhardt wrote: Hi all, is there a timeline for the release 1.0? Currently it exists 33 issues (9 Bugs). Is there a plan for a feature freeze? Maybe some big issues can be moved to version 1.1? We do exist. ;) We plan to release in February - I can't tell you yet when exactly, we need to review the (few) remaining issues that we want to resolve before the release. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Nutch ScoringFilter plugin problems
On Mon, Jan 26, 2009 at 2:17 PM, Pau pau...@gmail.com wrote: Hello, I still have the same problem. I have the following piece of code if (linkdb == null) { System.out.println(Null linkdb); } else { System.out.println(LinkDB not null); } Inlinks inlinks = linkdb.getInlinks(url); System.out.println(a); On the output I can see it always prints LinkDB not null, so linkdb is not null. But a never gets printed, so I guess that at: Inlinks inlinks = linkdb.getInlinks(url); there is some error. Maybe the getInlinks function throws an IOException? I do catch the IOException, but the catch block is never executed either. It is very difficult to guess without seeing the exception. Maybe you can try catching everything (i.e Throwable) and printing it? One question, how should I create the LinkDBReader? I do it the following way: linkdb = new LinkDbReader(getConf(), new Path(crawl/linkdb)); Is it right? Thanks. On Wed, Jan 21, 2009 at 10:16 AM, Pau pau...@gmail.com wrote: Ok, I think you are right, maybe inlinks is null. I will try it now. Thank you! I have no information about the exception. It seems that simply the program skips this part of the code... maybe a ScoringFilterExcetion is thrown? On Wed, Jan 21, 2009 at 9:47 AM, Doğacan Güney doga...@gmail.com wrote: On Tue, Jan 20, 2009 at 7:18 PM, Pau pau...@gmail.com wrote: Hello, I want to create a new ScoringFilter plugin. In order to evaluate how interesting a web page is, I need information about the link structure in the LinkDB. In the method updateDBScore, I have the following lines (among others): 88linkdb = new LinkDbReader(getConf(), new Path(crawl/linkdb)); ... 99System.out.println(Inlinks to + url); 100Inlinks inlinks = linkdb.getInlinks(url); 101System.out.println(a); 102IteratorInlink iIt = inlinks.iterator(); 103System.out.println(b); a always gets printed, but b rarely gets printed, so this seems that in line 102 an error happens, and an exeception is raised. Do you know why this is happening? What am I doing wrong? Thanks. Maybe there are no inlinks to that page so inlinks is null? What is the exception exactly? -- Doğacan Güney -- Doğacan Güney
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
So, is it OK to remove pmd-ext directory for now? It is not clear if we need it when we have the infrastructure but we don't have the infrastructure now anyway :D. So, I suggest that we remove it for now (and we trim 2.2MB ), and add it back after 1.0 and actually use it. Is everyone OK with this? On Wed, Jan 21, 2009 at 12:01 AM, Piotr Kosiorowski pkosiorow...@gmail.com wrote: I have configured hudson for 10 or more projects and always used pmd plugin to display the pmd results only - the actual pmd task to generate report was run from ant script. Maybe there is such possibility tu run pmd reports directly in hudson (not through project build scripts) but I have never come accross it. Piotr On Tue, Jan 20, 2009 at 10:39 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: They've had pmd integrated with Hudson for many months now, I believe. I've seen patches in JIRA that were the result of fixes for problems reported by pmd. Or maybe they run pmd by hand? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 3:40:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic wrote: That I don't know... I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/ But who knows, maybe maven/ivy fetch them on demand. I don't know. Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)? http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/ Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 1:13:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan
Re: Nutch ScoringFilter plugin problems
On Tue, Jan 20, 2009 at 7:18 PM, Pau pau...@gmail.com wrote: Hello, I want to create a new ScoringFilter plugin. In order to evaluate how interesting a web page is, I need information about the link structure in the LinkDB. In the method updateDBScore, I have the following lines (among others): 88linkdb = new LinkDbReader(getConf(), new Path(crawl/linkdb)); ... 99System.out.println(Inlinks to + url); 100Inlinks inlinks = linkdb.getInlinks(url); 101System.out.println(a); 102IteratorInlink iIt = inlinks.iterator(); 103System.out.println(b); a always gets printed, but b rarely gets printed, so this seems that in line 102 an error happens, and an exeception is raised. Do you know why this is happening? What am I doing wrong? Thanks. Maybe there are no inlinks to that page so inlinks is null? What is the exception exactly? -- Doğacan Güney
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
2009/1/20 Piotr Kosiorowski pkosiorow...@gmail.com: pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) j...@apache.org wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney -- Doğacan Güney
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: That I don't know... I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/ But who knows, maybe maven/ivy fetch them on demand. I don't know. Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)? http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/ Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 1:13:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney -- Doğacan Güney -- Doğacan Güney
Re: RSS-fecter and index individul-how can i realize this function
On Mon, Jan 5, 2009 at 7:00 AM, Vlad Cananau vlad...@gmail.com wrote: Hello I'm trying to make RSSParser do something simmilar to FeedParser (which doesn't work quite right) - that is, instead of indexing the whole contents Why doesn't FeedParser work? Let's fix whatever is broken in it :D of the feed, I want it to show individual items, with their respective title and and proper link to the article I realize that I could index 1 depth more, but I'd like to index just the feed, not the articles that go with it (keep the index small and the crawl fast). For each item in each RSS channel (the code does not differ much for getParse() of RSSParser.java) I do something like Outlink[] outlinks = new Outlink[1]; try{ outlinks[0] = new Outlink(whichLink, theRSSItem.getTitle()); } catch (Exception e) { continue; } parseResult.put( whichLink, new ParseText(theRSSItem.getTitle() + theRSSItem.getDescription()), new ParseData( ParseStatus.STATUS_SUCCESS, theRSSItem.getTitle(), outlinks, new Metadata() //was content.getMetadata() ) ); The problem is, however, that only one item from the whole RSS gets into the index, although in the log I can see them all ( I've tried it with feeds from cnn and reuters). What happens? Why do they get overwritten in a seemingly random order? The item that makes it into the index is neither the first nor the last, but appears to be the same until new items appear in the feed. Thank you, Vlad -- Doğacan Güney
Re: readlinkdb fails to dump linkdb
On Thu, Dec 4, 2008 at 11:33 AM, brainstorm [EMAIL PROTECTED] wrote: On Wed, Dec 3, 2008 at 8:29 PM, Doğacan Güney [EMAIL PROTECTED] wrote: On Wed, Dec 3, 2008 at 8:55 PM, brainstorm [EMAIL PROTECTED] wrote: Using nutch 0.9 (hadoop 0.17.1): [EMAIL PROTECTED] working]$ bin/nutch readlinkdb /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt LinkDb dump: starting LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb It seems you are providing a crawldb as argument. You should pass the linkdb. Thanks a lot for the hint, but I cannot find linkdb dir anywhere on the HDFS :_/ Can you point me where should it be ? A linkdb is created with the command: invertlinks, e.g: bin/nutch invertlinks crawl/linkdb crawl/segments/ java.io.IOException: Type mismatch in value from map: expected org.apache.nutch.crawl.Inlinks, recieved org.apache.nutch.crawl.CrawlDatum at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427) at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) LinkDbReader: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110) at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114) This is the first time I use readlinkdb and the rest of the crawling process is working ok, I've looked up JIRA and there's no related bug. I've also tried latest trunk nutch but DFS is not working for me: [EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls Exception in thread main java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118) at org.apache.hadoop.fs.FsShell.init(FsShell.java:88) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646) ... 10 more Should I file both bugs on JIRA ? This I am not sure, but did you try ant clean; ant? It may be a version mismatch. Yes, I did ant clean ant before trying the above command. I also tried to upgrade the filesystem unsuccessfully and even created it from scratch: https://issues.apache.org/jira/browse/HADOOP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12650556#action_12650556 -- Doğacan Güney -- Doğacan Güney
Re: readlinkdb fails to dump linkdb
On Wed, Dec 3, 2008 at 8:55 PM, brainstorm [EMAIL PROTECTED] wrote: Using nutch 0.9 (hadoop 0.17.1): [EMAIL PROTECTED] working]$ bin/nutch readlinkdb /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt LinkDb dump: starting LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb It seems you are providing a crawldb as argument. You should pass the linkdb. java.io.IOException: Type mismatch in value from map: expected org.apache.nutch.crawl.Inlinks, recieved org.apache.nutch.crawl.CrawlDatum at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427) at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) LinkDbReader: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062) at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110) at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114) This is the first time I use readlinkdb and the rest of the crawling process is working ok, I've looked up JIRA and there's no related bug. I've also tried latest trunk nutch but DFS is not working for me: [EMAIL PROTECTED] trunk]$ bin/hadoop dfs -ls Exception in thread main java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118) at org.apache.hadoop.fs.FsShell.init(FsShell.java:88) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hdfs.DistributedFileSystem at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646) ... 10 more Should I file both bugs on JIRA ? This I am not sure, but did you try ant clean; ant? It may be a version mismatch. -- Doğacan Güney
Re: Pending Commits for Nutch Issues
Hi Dennis, On Wed, Nov 26, 2008 at 11:42 PM, Dennis Kubes [EMAIL PROTECTED] wrote: If nobody has a problem with them I would like to commit the following issues in the next day or two: NUTCH-663: Upgrade Nutch to the most recent Hadoop version (0.19) NUTCH-662: Upgrade Nutch to the most recent Lucene version (2.4) NUTCH-647: Resolve URLs tool NUTCH-665: Search Load Testing Tool NUTCH-667: Input Format for working with Content in Hadoop Streaming And I would like to commit these in a week: NUTCH-635: LinkAnalysis Tool for Nutch NUTCH-646: New Indexing framework for Nutch NUTCH-594: Serve Nutch search results in XML and JSON NUTCH-666: Analysis plugins and new language identifier. There are others too but these are the ones I am trying to get moved into trunk right now. I am OK with all but NUTCH-666... Why a new language identifier? (or if a new one, why keep old one around?) Dennis -- Doğacan Güney
Re: NUTCH-92
Hi, On Wed, Nov 26, 2008 at 3:04 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi all, After reading this paper: http://wortschatz.uni-leipzig.de/~fwitschel/papers/ipm1152.pdf I came up with the following idea of implementing global IDF in Nutch. The upside of the approach I propose is that it brings back the cost of making a search query to 1 RPC call. The downside is that the search servers need to cache global IDF estimates as computed by the DS.Client, which ties them to a single query front-end (DistributedSearch.Client), or requires keeping a map of client, globalIDFs on each search server. - First, as the paper above claims, we don't really need exact IDF values of all terms from every index. We should get acceptable quality if we only learn the top-N frequent terms, and for the rest of them we apply a smoothing function that is based on global characteristics of each index (such as the number of terms in the index). This means that the data that needs to be collected by the query integrator (DS.Client in Nutch) from shard servers (DS.Server in Nutch) would consist of a list of e.g. top 500 local terms with their frequency, plus the local smoothing factor as a single value. We could further reduce the amount of data to be sent from/to shard servers by encoding this information in a counted Bloom filter with a single-byte resolution (or a spectral Bloom filter, whichever yields a better precision / bit in our case). The query integrator would ask all active shard servers to provide their local IDF data, and it would compute global IDFs for these terms, plus a global smoothing factor, and send back the updated information to each shard server. This would happen once per lifetime of a local shard, and is needed because of the local query rewriting (and expansion of terms from Nutch Query to Lucene Query). Shard servers would then process incoming queries using the IDF estimates for terms included in the global IDF data, or the global smoothing factors for terms missing from that data (or use local IDFs). The global IDF data would have to be recomputed each time the set of shards available to a DS.Client changes, and then it needs to be broadcast back from the client to all servers - which is the downside of this solution, because servers need to keep a cache of this information for every DS.Client (each of them possibly having a different list of shard servers, hence different IDFs). Also, as shard servers come and go, the IDF data keeps being recomputed and broadcast, which increases the traffic between the client and servers. Still I believe the amount of additional traffic should be minimal in a typical scenario, where changes to the shards are much less frequent than the frequency of sending user queries. :) -- Now, if this approach seems viable (please comment on this), what should we do with the patches in NUTCH-92 ? 1. skip them for now, and wait until the above approach is implemented, and pay the penalty of using skewed local IDFs. 2. apply them now, and pay the penalty of additional RPC call / search, and replace this mechanism with the one described above, whenever that becomes available. It seems I wrote the patch in NUTCH-92. My recollection was that you wrote it, Andrzej :D Anyway, I have no idea what I did in that patch, don't know if it works or applies etc. Really, I am just curios. Did anyone test it? Does it really work :) ? I haven't read the paper yet but the proposed approach sounds better to me. Do you have any code ready, Andrzej? Or how difficult is it to implement it? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Pending Commits for Nutch Issues
And here is a list of issues from me that needs more discussion/review: NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to review for people, for now we can just write a SolrIndexer like Sami Siren's and deal with 442 after 1.0. I would be happy to provide such a patch. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I don't know how to fix this one but indexing almost always fails with index-more enabled. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly: I botched it once so now I am afraid to commit it :D NUTCH-626 - fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects: I am going to update the patch and commit it if no objections. Also, I think NUTCH-658 would be a nice feature for 1.0. There are some others but these are the most recent and we really should push 1.0 out the door already :D Oh and finally we should do a review of all libraries in nutch (libraries in plugins included) and update them to latest versions. I am going to open an issue with the intenton of updating all the libraries that do not require code changes. -- Doğacan Güney
Re: Pending Commits for Nutch Issues
I forgot: I think there is a huge bug with MapWritable in nutch. I didn't yet figure out what it is exactly but it has something to do with the fact that id-class maps are static. On Thu, Nov 27, 2008 at 7:10 PM, Doğacan Güney [EMAIL PROTECTED] wrote: And here is a list of issues from me that needs more discussion/review: NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to review for people, for now we can just write a SolrIndexer like Sami Siren's and deal with 442 after 1.0. I would be happy to provide such a patch. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I don't know how to fix this one but indexing almost always fails with index-more enabled. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly: I botched it once so now I am afraid to commit it :D NUTCH-626 - fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects: I am going to update the patch and commit it if no objections. Also, I think NUTCH-658 would be a nice feature for 1.0. There are some others but these are the most recent and we really should push 1.0 out the door already :D Oh and finally we should do a review of all libraries in nutch (libraries in plugins included) and update them to latest versions. I am going to open an issue with the intenton of updating all the libraries that do not require code changes. -- Doğacan Güney -- Doğacan Güney
Re: Pending Commits for Nutch Issues
OK one last thing: Get rid of Fetcher and promote Fetcher2 to be the default fetcher. On Thu, Nov 27, 2008 at 7:15 PM, Doğacan Güney [EMAIL PROTECTED] wrote: I forgot: I think there is a huge bug with MapWritable in nutch. I didn't yet figure out what it is exactly but it has something to do with the fact that id-class maps are static. On Thu, Nov 27, 2008 at 7:10 PM, Doğacan Güney [EMAIL PROTECTED] wrote: And here is a list of issues from me that needs more discussion/review: NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to review for people, for now we can just write a SolrIndexer like Sami Siren's and deal with 442 after 1.0. I would be happy to provide such a patch. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I don't know how to fix this one but indexing almost always fails with index-more enabled. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly: I botched it once so now I am afraid to commit it :D NUTCH-626 - fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects: I am going to update the patch and commit it if no objections. Also, I think NUTCH-658 would be a nice feature for 1.0. There are some others but these are the most recent and we really should push 1.0 out the door already :D Oh and finally we should do a review of all libraries in nutch (libraries in plugins included) and update them to latest versions. I am going to open an issue with the intenton of updating all the libraries that do not require code changes. -- Doğacan Güney -- Doğacan Güney -- Doğacan Güney
Re: NUTCH-92
On Thu, Nov 27, 2008 at 11:40 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: It seems I wrote the patch in NUTCH-92. My recollection was that you wrote it, Andrzej :D No, I didn't - you did! :) I only came up with the proposal, after discussing it with Doug. Anyway, I have no idea what I did in that patch, don't know if it works or applies etc. Really, I am just curios. Did anyone test it? Does it really work :) ? Not me. I shied away from the patch because I didn't like the 2 RPC-s per search. I still don't like it, but I may have to accept it as an interim solution. That was my question, really - for release 1.0: * are we better off not having this patch, and just be careful how we split indexes among searchers as we do it now, or * should we apply the patch, pay the price of 2 RPCs, and wait for the patch implementing the approach that I proposed? * or make an effort to implement the new approach, and postpone the release until this is ready. 3rd approach sounds the best, especially if new approach is not difficult to implement. (I may even give it a try if I have the time) I haven't read the paper yet but the proposed approach sounds better to me. Do you have any code ready, Andrzej? Or how difficult is it to implement it? No code yet, just thinking aloud. But it's not really anything complicated, chunks of code already exist that implement almost all building blocks of the algorithm. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: 1.0 Release?
I agree with this list and have nothing new to add. (Except, I guess people also want NUTCH-92 to be fixed) On Thu, Nov 20, 2008 at 6:51 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Dennis Kubes wrote: What does everybody think of trying to do a Nutch 1.0 release in the next couple of weeks. I have 8 different patches that are ready to be committed including: 1) NUTCH-647: Resolve URLs tool 2) NUTCH-635: LinkAnalysis Tool for Nutch 3) NUTCH-646: New Indexing framework for Nutch 4) NUTCH-594: Serve Nutch search results in XML and JSON 5) Custom fields on index and plugins 6) Upgrade Nutch to the most recent Hadoop version (18.2). 7) Upgrade Nutch to the most recent Lucene version (2.4). 8) Analysis plugins and improvments to analyzer factory for multiple languages per analysis plugin. Language identifier. I am going to try to get those posted in the next couple of days and committed in the next week. Are there other major improvements we want to put in before trying to do a 1.0 release for Nutch? Thoughts and suggestions? A few recently opened ones that should be easy to fix: NUTCH-661errors when the uri contains space characters NUTCH-657Estonian N-gram profile has wrong name NUTCH-652AdaptiveFetchSchedule#setFetchSchedule doesn't calculate fetch interval correctly NUTCH-644RTF parser doesn't compile anymore NUTCH-643ClassCastException in PdfParser on encrypted PDF with empty password NUTCH-636Http client plug-in https doesn't work on IBM JRE NUTCH-631MoreIndexingFilter fails with NoSuchElementException NUTCH-626fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects NUTCH-566Sun's URL class has bug in creation of relative query URLs NUTCH-542Null Pointer Exception on getSummary when segment no longer exists NUTCH-531Pages with no ContentType cause a Null Pointer exception And of course this one: NUTCH-442Integrate Solr/Nutch We should also review all other open issues marked as Blocker / Major, especially those with patches, and take some action - either fix them, or won't fix 'em, or postpone to the next release (the single Blocker issue should be fixed). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Help needed in Integrating a module
On Sat, Sep 27, 2008 at 10:32 PM, Nimesh Priyodit [EMAIL PROTECTED] wrote: Hi, Recently i have developed my own stemmer. Can you please tell me how to integrate the module which i wrote, into nutch? Where exactly do you want to integrate it? Into indexing? Regards, Nimesh -- Doğacan Güney
Re: Crawled documents in readable format
On Sat, Sep 27, 2008 at 9:24 PM, Allan Avendaño [EMAIL PROTECTED] wrote: Hi to all! I would like to get the nutch crawled documents within readable format, How could I do that? You can try readseg tool, i.e bin/nutch readseg -dump segment output -nofetch -noparse -nogenerate -noparsedata -nocontent This will give you parsed text of segments. Thanks for ur help -- Allan Roberto Avendaño Sudario Guayaquil-Ecuador Home : +593(4) 2800 692 Office : +593(4) 2269 268 + MSN-Messenger: [EMAIL PROTECTED] + Gmail: [EMAIL PROTECTED] -- Doğacan Güney
Re: Droids crawler
On Fri, Sep 12, 2008 at 5:38 PM, Dennis Kubes [EMAIL PROTECTED] wrote: Interesting. Worth a deeper look I think. I think one of the keys to a new version of nutch would be crawler extensibility. I agree. So let's start a discussion then. What is missing from nutch's crawler? What does droids do that we don't? Dennis Andrzej Bialecki wrote: Hi all, In the light of discussion about the future of Nutch I'd lie to draw your attention to Droids - a small crawler framework that uses Spring for extensibility. http://people.apache.org/~thorsten/droids/ Are there any lessons there that we could learn? -- Doğacan Güney
Re: Next release?
Hi, On Tue, Feb 19, 2008 at 11:14 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi all, I propose to start planning for the next release, and tentatively I propose to schedule it for the beginning of April. I'm going to close a lot of old and outdated issues in JIRA - other committers, please do the same if you know that a given issue no longer applies. There are some issues I want to put in before a release. Most are trivial but I would like to draw attention to NUTCH-442, as it is an issue that I (and looking at its votes, others) want to see resolved before another release. I really could use some review and suggestions there (well, I guess I am partly to blame since I failed to update the patch after Enis's comments). Out of the remaining open issues, we should resolve all with the blocker / major status, and of the type bug. Then we can resolve as many as we can from the remaining categories, depending on the votes and perceived importance of the issue. Any other suggestions? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Backwards compatibility strategy
Hi, On Nov 22, 2007 7:45 PM, Sami Siren [EMAIL PROTECTED] wrote: Hello all, Currently there are many places in Nutch that tries to handle older formats of serialized data. This (at least in longer run) will make the code harder to understand, harder to test and harder to maintain. IMO it would be more clean to offer conversion with separate tools (like CrawlDbConverter) and keep the rest of the code clean from such functionality. Opinions? I disagree. Posts on nutch-user show that people are confused when we break compatibility. If backward compatibility code within other code is getting messy, then we can use conversion tools but they should be transparent to regular user. For example, before a nutch job runs a small program can check if any conversion needs to be applied (this program can check comptaibility by reading a few records of a segment) then print a warning and first run this conversion job then run requested job. I personally favor starting from scratch when switching version but probably there are users who wish to convert older data or are there? -- Sami Siren -- Doğacan Güney
Re: Build failed in Hudson: Nutch-Nightly #261
] symbol : constructor Outlink(java.lang.String,java.lang.String,org.apache.hadoop.conf.Configuration) [javac] location: class org.apache.nutch.parse.Outlink [javac] outlinks[i] = new Outlink(http://outlink.com/; + i, Outlink + i, conf); [javac] ^ [javac] http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/test/org/apache/nutch/util/TestFibonacciHeap.java :65: cannot find symbol [javac] symbol : class FibonacciHeap [javac] location: class org.apache.nutch.util.TestFibonacciHeap [javac] FibonacciHeap h= new FibonacciHeap(); [javac] ^ [javac] http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/test/org/apache/nutch/util/TestFibonacciHeap.java :65: cannot find symbol [javac] symbol : class FibonacciHeap [javac] location: class org.apache.nutch.util.TestFibonacciHeap [javac] FibonacciHeap h= new FibonacciHeap(); [javac] ^ [javac] Note: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 5 errors BUILD FAILED http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build.xml :217: Compile failed; see the compiler error output for details. Total time: 52 seconds Publishing Javadoc Recording test results No test report files were found. Configuration error? Updating NUTCH-548 Updating NUTCH-494 Updating NUTCH-547 Updating NUTCH-538 -- Doğacan Güney
Re: JIRA emails and Nutch
On Nov 4, 2007 8:36 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Dennis Kubes wrote: I don't think JIRA emails are being sent out for Nutch. Changes from yesterday and today have yet to be mailed out. Commit emails are being mailed. Is this something that we send to infrastructure? I think so. Speaking of which, I noticed that I also stopped getting commit messages, which is double strange ... I'll try to subscribe manually and see what happens. Any progress on this? I was thinking of committing/resolving some issues in JIRA but I want to wait until emails start working. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Next move with JIRA ticket
Hi, On 10/31/07, Ned Rockson [EMAIL PROTECTED] wrote: I submitted a JIRA ticket regarding URL ordering in Generator.java as well as a patch (NUTCH-570) and I'm wondering what else I need to do to get this committed. Obviously it's low priority so I may be getting too antsy. Since NUTCH-570 tracks a non-trivial change and nutch development is a bit slow these days, it may be a while before someone can review your patch and make a comment on it. Personally, I have been meaning to take a look at your patch, but I have been too lazy^Wbusy lately. What you can do is that, for example, you can send some statistics regarding overhead of running two extra jobs or fetch performance increase as a result of smarter url ordering. Again personally, I find that patches with such numbers and test cases are a lot easier to review (thus, easier to commit:). -- Doğacan Güney
Re: Adding new class to nutch
Hi, On 10/29/07, eyal edri [EMAIL PROTECTED] wrote: Hi, i'm interested in adding a new class on my own to nutch, to allow a few config needed to our application (such as reading config file, etc.) i've written a new java class called: LabConf.java and placed it in the $NUTCH_HOME/src/java/org/apache/nutch/util dir. after running ant i didnt see any messages indicating that this code was added to the project. can anyone tell me where i need to tell nutch to regard to this new class? New class should be added. By default, ant compiles everything under src/java/org/apache/nutch. thanks -- Eyal Edri -- Doğacan Güney
Re: First Plugin
Hi, On 10/5/07, Sagar Vibhute [EMAIL PROTECTED] wrote: Hi, I have recently downloaded and used nutch and I need to develop a few plugins for my work. I took the plugin example given on the wiki, http://wiki.apache.org/nutch/WritingPluginExample-0%2e9 and followed the instructions as given there. Now when I start crawling again it aborts and throws the following exception: Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.crawl.Injector.inject(Injector.java:162) at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) I could crawl successfully before I added this plugin. Please give any insights you can to get this fixed. (I really should add this to FAQ) This log doesn't help us. This simply tells us that crawling has failed. You have to check your logs elsewhere (logs/hadoop.log directory if you are local and your tasktracker's logs if you are running in distributed mode). If you can send those logs we can make a more informed analysis about your problem. Thank You! - Sagar -- Doğacan Güney
Re: First Plugin
. OK, it seems you have removed scoring-opic plugins (and other scoring plugins if you have any) by accident. You should check your plugin.includes option in nutch-site.xml, there is probably something wrong with that. Perhaps, you put a new line there? - Sagar -- Doğacan Güney
Re: Build failed in Hudson: Nutch-Nightly #221
org.apache.nutch.metadata.TestMetadata [junit] Tests run: 10, Failures: 0, Errors: 0, Time elapsed: 0.364 sec [junit] Running org.apache.nutch.metadata.TestSpellCheckedMetadata [junit] Tests run: 11, Failures: 0, Errors: 0, Time elapsed: 13.652 sec [junit] Running org.apache.nutch.net.TestURLFilters [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.91 sec [junit] Running org.apache.nutch.net.TestURLNormalizers [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.909 sec [junit] Running org.apache.nutch.ontology.TestOntologyFactory [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.262 sec [junit] Running org.apache.nutch.parse.TestOutlinkExtractor [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.972 sec [junit] Running org.apache.nutch.parse.TestParseData [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.393 sec [junit] Running org.apache.nutch.parse.TestParseText [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.348 sec [junit] Running org.apache.nutch.parse.TestParserFactory [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.376 sec [junit] Running org.apache.nutch.plugin.TestPluginSystem [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 2.486 sec [junit] Running org.apache.nutch.protocol.TestContent [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.602 sec [junit] Running org.apache.nutch.protocol.TestProtocolFactory [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.241 sec [junit] Running org.apache.nutch.searcher.TestHitDetails [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.297 sec [junit] Running org.apache.nutch.searcher.TestOpenSearchServlet [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.328 sec [junit] Running org.apache.nutch.searcher.TestQuery [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 1.758 sec [junit] Running org.apache.nutch.searcher.TestSummarizerFactory [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.13 sec [junit] Running org.apache.nutch.searcher.TestSummary [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.446 sec [junit] Running org.apache.nutch.util.TestEncodingDetector [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 0.667 sec [junit] Test org.apache.nutch.util.TestEncodingDetector FAILED Heh, another java5/java6 problem (Charset.isSupported(utf-32) is false for java 5, true for java 6). I have made another commit, hopefully everything will be OK this time. [junit] Running org.apache.nutch.util.TestFibonacciHeap [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.115 sec [junit] Running org.apache.nutch.util.TestGZIPUtils [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.293 sec [junit] Running org.apache.nutch.util.TestNodeWalker [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.84 sec [junit] Running org.apache.nutch.util.TestPrefixStringMatcher [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.364 sec [junit] Running org.apache.nutch.util.TestStringUtil [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.231 sec [junit] Running org.apache.nutch.util.TestSuffixStringMatcher [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.357 sec [junit] Running org.apache.nutch.util.TestURLUtil [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.39 sec [junit] Running org.apache.nutch.util.mime.TestMimeType [junit] Tests run: 17, Failures: 0, Errors: 0, Time elapsed: 0.246 sec [junit] Running org.apache.nutch.util.mime.TestMimeTypes [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 1.695 sec BUILD FAILED http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build.xml :297: Tests failed! Total time: 4 minutes 12 seconds Publishing Javadoc Recording test results -- Doğacan Güney
Re: Scoring API issues (LONG)
On 9/18/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: public void prepareInjectorConfig(Path crawlDb, Path urls, Configuration config); public void prepareGeneratorConfig(Path crawlDb, Configuration config); public void prepareIndexerConfig(Path crawlDb, Path linkDb, Path[] segments, Configuration config); public void prepareUpdateConfig(Path crawlDb, Path[] segments, Configuration config); Should we really pass Path-s to methods? IMHO, opening a file and reading from it looks a bit cumbersome. I would suggest that the relevant job would read the file then pass the data (MapWritable) to the method. For example, prepareGeneratorConfig would look like this: public void prepareGeneratorConfig(MapWritable crawlDbMeta, Configuration config); What about the segment's metadata in prepareUpdateConfig? Following your idea, we would have to pass a MapString segmentName, MapWritable metaData ... Yeah, I think it looks good but I guess you disagree? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Host-level stats, ranking and recrawl
Hi, On 9/17/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi, I was recently reading again some scoring-related papers, and found some interesting data in a paper by Baeza-Yates et al, Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering (http://citeseer.ist.psu.edu/730674.html). This paper compares various strategies for prioritizing a crawl of unfetched pages. Among others, it compared the OPIC scoring and a simple strategy which is called large sites first. This strategy prioritizes pages from large sites and deprioritizes pages from small / medium sites. In order to measure the effectiveness the authors used the value of accumulated PageRank vs. the percentage of crawled pages - the strategy that ensures quick ramp-up of aggregate pagerank is the best. A bit surprisingly, they found that large-sites-first wins over OPIC: Breadth-first is close to the best strategies for the first 20-30% of pages, but after that it becomes less efficient. The strategies batch-pagerank, larger-sites-first and OPIC have better performance than the other strategies, with an advantage towards larger-sites-first when the desired coverage is high. These strategies can retrieve about half of the Pagerank value of their domains downloading only around 20-30% of the pages. Nutch currently uses OPIC-like scoring for this, so most likely it suffers from the same symptoms (the authors also mention a relatively poor OPIC performance at the beginning of a crawl). Nutch doesn't collect at the moment any host-level statistics, so we couldn't use the other strategy even if we wanted. What if we added a host-level DB to Nutch? Arguments against this: it's an additional data structure to maintain, and this adds complexity to the system; it's an additional step in the workflow (- it takes longer time to complete one cycle of crawling). Arguments for are the following: we could implement the above scoring method ;), plus the host-level statistics are good for detecting spam sites, limiting the crawl by site size, etc. Another +1. We definitely need domain-level statistics anyway, so being able to implement large-sites-first is a nice bonus, I think :) We could start by implementing a tool to collect such statistics from CrawlDb - this should be a trivial map-reduce job, so if anyone wants to take a crack at this it would be a good exercise ... ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Scoring API issues (LONG)
segment. Each operation that changes a db or a segment would update this information. In practial terms, I propose to add static methods to CrawlDbReader, LinkDbReader and SegmentReader, which can retrieve and / or update this information. 3. Initialization of scoring plugins with global information Current scoring API works only with local properties of the page (I'm not taking into account plugins that use external information sources - that's outside of the scope of the API). It doesn't have any built-in facilities to collect and calculate global properties useful for PR or HITS calculation, such as e.g. the number of dangling nodes (ie. pages without outlinks), their total score, the number of inlinks, etc. It doesn't have the facility to output this collected global information at the end of the job. Neither has it any facility to initialize scoring plugins with such information if one exists. I propose to add the following methods to scoring plugins, so that they can modify the job configuration right before the job is started, so that later on the plugins could use this information when scoring filters are initialized in each task. E.g: public void prepareInjectorConfig(Path crawlDb, Path urls, Configuration config); public void prepareGeneratorConfig(Path crawlDb, Configuration config); public void prepareIndexerConfig(Path crawlDb, Path linkDb, Path[] segments, Configuration config); public void prepareUpdateConfig(Path crawlDb, Path[] segments, Configuration config); Should we really pass Path-s to methods? IMHO, opening a file and reading from it looks a bit cumbersome. I would suggest that the relevant job would read the file then pass the data (MapWritable) to the method. For example, prepareGeneratorConfig would look like this: public void prepareGeneratorConfig(MapWritable crawlDbMeta, Configuration config); Example: to properly implement the OPIC scoring, it's necessary to collect the total number of dangling nodes, and the total score from these nodes. Then, in the next step it's necessary to spread this total score evenly among all other nodes in the crawldb. Currently this is not possible unless we run additional jobs, and create additional files to keep this data around between the steps. It would be more convenient to keep this data in CrawlDb metadata (see above) and make relevant values available in the job context (Configuration). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Build failed in Hudson: Nutch-Nightly #203
]^ [javac] 3 errors BUILD FAILED Hmm, I can compile nutch successfully with Java 6 but not with Java 5. Is there an override annotation change between java 5 and java 6? http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build.xml :112: The following error occurred while executing this line: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/plugin/build.xml :76: The following error occurred while executing this line: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/plugin/build-plugin.xml :111: Compile failed; see the compiler error output for details. Total time: 47 seconds Publishing Javadoc Recording test results No test report files were found. Configuration error? Updating NUTCH-550 Updating NUTCH-546 -- Doğacan Güney
Re: Build failed in Hudson: Nutch-Nightly #203
On 9/11/07, Susam Pal [EMAIL PROTECTED] wrote: Is it that the interface 'org.apache.nutch.net.URLFilter' was compiled with JDK 1.5 earlier? I have seen this problem happening with a beta version of JDK 1.6. No, it still happens with an ant clean; ant. The problem seems to be that Java 5 errs on override annotations for implemented methods while java 6 is OK with them. Both are ok with override for extended methods. Are you using the latest version, JDK 1.6 Update 2? $ java -version java version 1.6.0_02 Java(TM) SE Runtime Environment (build 1.6.0_02-b05) Java HotSpot(TM) Client VM (build 1.6.0_02-b05, mixed mode, sharing) Anyways, I am going to commit a small fix that removes override annotations so that code can be compiled. Regards, Susam Pal http://susam.in/ On 9/11/07, Doğacan Güney [EMAIL PROTECTED] wrote: On 9/11/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/203/changes Changes: [dogacan] NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. [dogacan] NUTCH-546 - file URL are filtered out by the crawler. -- [...truncated 4410 lines...] [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/tld [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/tld/classes [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/tld/test init-plugin: deps-jar: compile: [echo] Compiling plugin: tld [javac] Compiling 2 source files to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/tld/classes jar: [jar] Building jar: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/tld/tld.jar deps-test: deploy: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/tld [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/tld copy-generated-lib: [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/tld Overriding previous definition of reference to plugin.deps [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-automaton/test/data [copy] Copying 6 files to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-automaton/test/data init: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-automaton/classes Overriding previous definition of reference to plugin.deps init-plugin: deps-jar: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter jar: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter compile-test: [javac] Compiling 1 source file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/lib-regex-filter/test [javac] Note: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. compile: [echo] Compiling plugin: urlfilter-automaton [javac] Compiling 1 source file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-automaton/classes jar: [jar] Building jar: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/urlfilter-automaton/urlfilter-automaton.jar deps-test: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: lib-regex-filter jar: deps-test: deploy: copy-generated-lib: deploy: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-automaton [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-automaton Overriding previous definition of reference to plugin.deps copy-generated-lib: [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-automaton [copy] Copying 1 file to http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk/build/plugins/urlfilter-automaton init: [mkdir] Created dir: http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/ws/trunk
Re: Limiting outlink tags.
Hi Marcin, On 9/7/07, Marcin Okraszewski [EMAIL PROTECTED] wrote: Hi, I have noticed that Nutch considers img/@src as an outlink. I suppose in many cases people do not want to threat image as an outlink. At least I don't want. The same case is with script/@src. But, it seems there is no way to limit outlink tags. The DOMContentUtils.getOutlinks() takes links from all a,area,form,frame,iframe,script,link,img. Only form element can be turned off by parser.html.form.use_action parameter. I would suggest to introduce a new configuration parameter which could be used to turn on or off certain elements. It could be simply done by single parameter, which would contain coma separated list of tags to be turned off. What is your opinion? If you think it is a valid issue I can make a patch for this. There is already NUTCH-488 open for this (with a patch). Feel free to add comments/patches/etc. there. Btw, I agree that using a CSV is better than using a new configuration parameter for every tag. Regards, Marcin -- Doğacan Güney
Re: bug with generate performance
Hi, On 8/31/07, misc [EMAIL PROTECTED] wrote: Hello- I am almost certain I have found a nasty bug with nutch genereate. Problem: Nutch generate can take many hours, even a day to complete (on a crawldb that has less than 2 million urls). I added debug code to Generator-Selector.map to see when map is called and returns, and observed interesting behavior, described here: 1. Most of the time, when generate is run urls are processed in chunky batches, usually about 40 at a time, followed by a 1 second delay. I timed the delay, and it really is a 1 second delay (ie- 30 batches was 30 seconds.) When this happens it takes hours to complete. 2. Sometimes (randomly as far as I can tell) when I run nutch, the urls are processed without delays. It is an all or nothing event, either I run and all urls process quickly without delay (in minutes), or more likely I get the chunky processing with many 1 second delays and the program takes hours to end. The one exception is 3. When the processing runs quickly I've seen the main thread end (I have some profiling going, so I know when a thread ends), and then more likely than not a second thread begins where the first starts, chunky like usual. Although I sometimes can get fast processing in one thread, it is almost impossible for me te get it in all threads and therefore general processing is very slow (hours). 4. I tried to put in more debug code to find the line where the delays occured, but the last line printed to the log at a delay seemed random, leading me to believe that the log is not being flushed uniformly. 5. The profiler I used seemed to imply that about 100% of the time was spent in javallang.Thread.sleep. I am not completely familiar with the profiler I used so I am not completely sure I inturpreted this correctly. I will keep debugging here, but perhaps someone here has some insight into what might be happening? Others have also reported a problem with generate performance. It seems we have a problem here but I can not reproduce this behaviour so I am not sure what causes it. Can you open a JIRA issue and enter your comments there? Also, how you are running generate will be very helpful (what is generate.max.per.host? what is -topN argument, etc.) thanks -J -- Doğacan Güney
Re: ant test failures
On 8/31/07, Christopher Bader [EMAIL PROTECTED] wrote: H, I'm new to this list. I checked the whole nutch source tree yesterday, then I ran ant and ant test in the trunk and the three branches. Ant succeeded in all four cases, but ant test succeeded only on the 0.7 branch. In other words, ant test failed on the trunk and on the 0.8 and 0.9 branches. One of the plugins fails with Java 1.6 (I think it is parse-swf, but I am not sure). This is known bug. Tests should pass successfully with 1.5. Is this the expected result? Or am I doing something wrong? I'm running Java 1.6 and Ant 1.7. CB -- Doğacan Güney
Re: Redirects and alias handling (LONG)
. - This issue has been briefly discussed in NUTCH-353. Inlink information should be merged so that all link information from all aliases is aggregated, so that it points to a selected canonical target URL. We should also merge their score. If example.com (with score 4.0) is an alias for www.example.com (with score 8.0), the selected url (which I think, as I said before, should be www.example.com) should end up with the score 12.0. We may not want to do this for aliases in different domains but I think we should definitely do this if two urls with the same content are under the same domain (like example.com). See also above sample queries from Google. B. Design and implementation In order to select the correct canonical URL at each stage in redirection handling we should keep the accumulated redirection path, which includes source URLs and redirection methods (temporary/permanent, protocol or content-level redirect, redirect delay). This way, when we arrive a the final page in the redirection path, we should be able to select the canonical path. We should also specify which intermediate URL we accept as the current canonical URL in case we haven't yet reached the end of redirections (e.g. when we don't follow redirects immediately, but only record them to be used in the next cycle). We should introduce an alias status in CrawlDb and LinkDb, which indicates that a given URL is a non-canonical alias of another URL. In CrawlDb, we should copy all accumulated metadata and put it into the target canonical CrawlDatum. In LinkDb, we should merge all inlinks pointing to non-canonical URLs so that they are assigned to the canonical URL. In both cases we should still keep the non-canonical URLs in CrawlDb and LinkDb - however we could decide not to keep any of the metadata / inlinks there, just an alias flag and a pointer to the canonical URL where all aggregated data is stored. CrawlDb and LinkDbReader may or may not hide this fact from their users - I think it would be more efficient if users of this API would get the final aggregated data right away, perhaps with an indicator that it was obtained using a non-canonical URL ... Regarding Lucene indexes - we could either duplicate all data for each non-canonical URL, i.e. create as many full-blown Lucene documents as many there are aliases, or we could create special redirect documents that would point to a URL which contains the full data ... We can avoid doing both. Let's assume A redirects to B, C also redirects to B and B redirects to D. After the fetch/parse/updatedb cycle that processes D we would probably have enough data to choose the 'canonical url' (let's assume that canonical is B). Then during Indexer's reduce we can just index parse text and parse data (and whatever else) of D under url B since we won't index B (or A or C) as itself (it doesn't have any useful content after all). That's it for now ... Any comments or suggestions to the above are welcome! Andrzej, have you written any code? I would suggest that we open a JIRA and have some code (no matter how much half-baked it is) as soon as we can. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Redirects and alias handling (LONG)
On 8/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: If the same content is available under multiple urls, I think it makes sense to assume that the url with the highest score should be 'the representative' url. Not necessarily - it depends how you defined your score. http://www.ibm.com/ may actually have a low score, because it immediately redirects to http://www.ibm.com/index.html (actually, it redirects to http://www.ibm.com/us/index.html). Also, the shortest url wins rule is not always true. Let's say I own a domain a.biz, and I made a Wikipedia mirror there. Which of the pages is more representative: http://a.biz/About_Wikipedia or http://www.wikipedia.org/en/About_Wikipedia ? 3. Link and anchor information for aliases and redirects. - This issue has been briefly discussed in NUTCH-353. Inlink information should be merged so that all link information from all aliases is aggregated, so that it points to a selected canonical target URL. We should also merge their score. If example.com (with score 4.0) is an alias for www.example.com (with score 8.0), the selected url (which I think, as I said before, should be www.example.com) should end up with the score 12.0. We may not want to do this for aliases in different domains but I think we should definitely do this if two urls with the same content are under the same domain (like example.com). I think you are right - at least with the OPIC scoring it would work ok. Regarding Lucene indexes - we could either duplicate all data for each non-canonical URL, i.e. create as many full-blown Lucene documents as many there are aliases, or we could create special redirect documents that would point to a URL which contains the full data ... We can avoid doing both. Let's assume A redirects to B, C also redirects to B and B redirects to D. After the fetch/parse/updatedb cycle that processes D we would probably have enough data to choose the 'canonical url' (let's assume that canonical is B). Then during Indexer's reduce we can just index parse text and parse data (and whatever else) of D under url B since we won't index B (or A or C) as itself (it doesn't have any useful content after all). Hmm. The index should somehow contain _all_ urls, which point to the same document. I.e. when you search for url http://example.com; it should ideally return exactly the same Lucene document as when you search for http://www.example.com/index.html;. Why would you do a search with the full name of the url? I also don't understand why we need to have all urls in index (we already eliminate near-duplicates with dedup). I guess I am missing your use case here... Similarly, the inlink information for all aliased urls should be the same (but in our case it's not a Lucene issue, only the LinkDb aliasing issue). I agree with you here. That's it for now ... Any comments or suggestions to the above are welcome! Andrzej, have you written any code? I would suggest that we open a JIRA and have some code (no matter how much half-baked it is) as soon as we can. Not yet - I'll open the issue and put these initial thoughts there. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Redirects and alias handling (LONG)
On 8/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: Hmm. The index should somehow contain _all_ urls, which point to the same document. I.e. when you search for url http://example.com; it should ideally return exactly the same Lucene document as when you search for http://www.example.com/index.html;. Why would you do a search with the full name of the url? I also don't understand why we need to have all urls in index (we already eliminate near-duplicates with dedup). I guess I am missing your use case here... Let's say I'm searching for test and I want to limit the search to a particular url. I enter a query: test url:example.com It should yield the same results as for the following query: test url:www.example.com (assuming they are aliases). I guess we can do something like this (continuing from my example above): Index D's data under B then add a alias field to the lucene document with A, C and D in it. Then change query-url so that a url: query also searches the alias field. Another, more realistic example: I'm searching for IBM products. So I enter a query: products site:ibm.com This should yield the same results as any of the following: products site:www.ibm.com products site:www-128.ibm.com products site:www-304.ibm.com Thanks for the explanation. How do we know that www.ibm.com and www-128.ibm.com hosts are perfect mirrors of one another? All we can know is that http://www.ibm.com/ and http://www-128.ibm.com/ *urls* are aliases of one another and that for the urls that we have fetched *so far* they seem to mirror each other. It is possible that the next URL we fetch from one of those sites does not exist in the other. I don't think that we can ever be certain that they are perfect mirrors of each other so, IMHO, we shouldn't treat those queries as same. Google also doesn't return the same results for products site:www.ibm.com products site:www-128.ibm.com . (One small unrelated note: As discussed in NUTCH-439 and NUTCH-445, we should treat site:ibm.com as all hosts under domain ibm.com even if http://www.ibm.com/ and http://ibm.com/ are perfect mirrors of each other.) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Is there any chance that my patches will be considered?
On 8/8/07, Marcin Okraszewski [EMAIL PROTECTED] wrote: Hello Nutch Developers, On May 22 I have contributed two patches - NUTCH-487 and NUTCH-490. The next release is probably coming soon. I would be really pleased if they are merged by then I don't have to patch it. Is there any chance they will be merged into source code? I can port them to current head, but so far nobody asked for this. NUTCH-487 is mostly a duplicate of NUTCH-369. I merged patches from both issues in NUTCH-25 since NUTCH-25 needs those patches to work correctly (and I gave you and Renaud Richardet credit in comments). As for NUTCH-490, I haven't taken an in-depth look at it, but I don't see the point of it. Why not just use HtmlParseFilters since you have access to the DOM object? What advantage do neko filters have? Also, having an extension point for a library possibly used by a possibly used plugin looks really really wrong from a design point. I would also like to point your attention to one point. It is already two and half month since I have added the patches. There is not even a single comment on this. This is really discouraging for me, as a contributor. I know that merging patches is not the thing that developers love to do, but you are the only one who can do it. Of course I don't mean you should thank to every contribution, but take it into account. Having someone's work being ignored, and it looks like this for me, really discourages from further work. Reviewing it and saying you won't merge it because something would be much better than leaving it without a single comment. This may reduce your active community. Think of this. Best regards, Marcin Okraszewski -- Doğacan Güney
Re: [jira] Commented: (NUTCH-527) MapWritable doesn't support all hadoops writable types
On 7/25/07, Robert Young [EMAIL PROTECTED] wrote: The message which was appearing in the logs is pasted below. Basically, in org.apache.nutch.crawl.MapWritable#getKeyValueEntry the Writable is instantiated. It's class is determined by a two byte code (which is written to crawldb I guess), if there is no entry for the class it fails to create it, regardless of if it's a Writable. You're right in that it can potentially handle any Writabl object, but only if it has a maping for it's class. If you add a writable that does not have a mapping, MapWritable automatically creates it and then stores the mapping internally. When a MapWritable is written, any new mapping is also written (as a byte and the corresponding class name). So, when you read a MapWritable it first reads all the mappings (that are not already statically defined) then proceeds to reading the Writable,Writable map. So I think your problem is caused by something else (perhaps, there is a bug in MapWritable's implementation but that is what that code is trying to do). Also, replying to JIRA generated emails does not add comments to issues (despite what the email is saying). So please use JIRA to reply. Cheers Rob 07/07/25 11:52:00 WARN crawl.MapWritable: Unable to load meta data entry, ignoring.. : java.io.IOException: unable to load class for id: 36 On 7/25/07, Doğacan Güney (JIRA) [EMAIL PROTECTED] wrote: [ https://issues.apache.org/jira/browse/NUTCH-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515283 ] Doğacan Güney commented on NUTCH-527: - What was the error you were having? MapWritable supports reading and writing *all* writables. The ones defined at the top of the file are an optimization and shouldn't affect correctness (basically mapwritable first writes a byte and the associated classname, then writes that byte to indicate classname everywhere else. For commonly used types we statically define the association so that first write the byte then the classname phase is not necessary). MapWritable doesn't support all hadoops writable types -- Key: NUTCH-527 URL: https://issues.apache.org/jira/browse/NUTCH-527 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Environment: Tested on Solaris and Windows with Java 1.5 Reporter: Rob Young Attachments: mapwritable.patch The map of classes which implement org.apache.hadoop.io.Writable is not complete. It does not, for example, include org.apache.hadoop.io.BooleanWritable. I would happily provide a patch if someone would explain what the Byte parameter is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney
Re: OOM error during parsing with nekohtml
Hi, On 7/17/07, Shailendra Mudgal [EMAIL PROTECTED] wrote: Hi all, Thanks for your suggestions. I am running parse on a single url ( http://www.fotofinity.com/cgi-bin/homepages.cgi). For other urls, parse works perfectly. we are getting this error because of the html of the page. The page contains many anchor tags which are not closed properly. Hence neko html parser throws this exception. The page can be parsed successfully using tagsoup. We think this as a bug in neko html parser. Since tagsoup works and neko doesn't, I agree with you that this is a bug with neko. If you want to skip over this page (parser will not extract text from this page but parsing will successfully run overall), you may try changing catch clause in ParseSegment. java:77 from Exception to Throwable. This should catch OOM and continue. Regards, Shailendra On 7/16/07, Tsengtan A Shuy [EMAIL PROTECTED] wrote: Thank you for the info. The OOM exception in your previous email indicates that your system is running out of heap memory. You either have instantiated too many objects, or there are memory leaks in the source codes. Hope this will help you! Cheer!! Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Kai_testing Middleton [mailto:[EMAIL PROTECTED] Sent: Monday, July 16, 2007 8:43 AM To: nutch-dev@lucene.apache.org Subject: Re: OOM error during parsing with nekohtml You could try looking at these two discussions: http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg06571.html --Kai - Original Message From: Tsengtan A Shuy [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Sent: Monday, July 16, 2007 3:45:59 AM Subject: RE: OOM error during parsing with nekohtml I successfully run the whole-web crawl with the my new ubuntu OS, and I am ready to fix the bug. I need someone to guide me to get the most updated source code and the bug assignment. Thank you in advance!! Adam Shuy, President ePacific Web Design Hosting Professional Web/Software developer TEL: 408-272-6946 www.epacificweb.com -Original Message- From: Shailendra Mudgal [mailto:[EMAIL PROTECTED] Sent: Monday, July 16, 2007 3:05 AM To: [EMAIL PROTECTED]; nutch-dev@lucene.apache.org Subject: OOM error during parsing with nekohtml Hi All, We are getting an OOM Exception during the processing of http://www.fotofinity.com/cgi-bin/homepages.cgi . We have also applied Nutch-497 patch to our source code. But actually the error is coming during the parse method. Does anybody has any idea regarding this. Here is the complete stacktrace : java.lang.OutOfMemoryError: Java heap space at java.lang.String.toUpperCase(String.java:2637) at java.lang.String.toUpperCase(String.java:2660) at org.cyberneko.html.filters.NamespaceBinder.bindNamespaces( NamespaceBinder.ja va:443) at org.cyberneko.html.filters.NamespaceBinder.startElement( NamespaceBinder.java :252) at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java :100 9) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:639) at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:646) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement( HTMLScanner.j ava:2343) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java :16 4) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:265) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:229) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:168) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:84) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445) Regards, Shailendra Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games. http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow -- Doğacan Güney
Re: Nutch nightly build and NUTCH-505 draft patch
Hi, On 7/2/07, Kai_testing Middleton [EMAIL PROTECTED] wrote: Recently I successully applied applied NUTCH-505_draft_v2.patch as follows: $ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch $ cd nutch $ wget https://issues.apache.org/jira/secure/attachment/12360411/NUTCH-505_draft_v2.patch --no-check-certificate $ sudo patch -p0 NUTCH-505_draft_v2.patch $ ant clean $ ant However, I also needed other recent nutch functionality, so I downloaded a nightly build: $ wget http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz I then attempted to apply the patch to that build using the successive steps. I was able to run ant clean but ant failed with build.xml:61: Specify at least one source--a file or resource collection Do I need to get a source checkout of a nightly build? How would I do that? Once you checkout nutch trunk with svn checkout, you can use svn up to get the latest code changes. You can also use svn st -u which compares your local version against trunk and shows you what changed. Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/ -- Doğacan Güney
Re: OPIC scoring differences
On 7/9/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Carl Cerecke wrote: Hi, The docs for the OPICScoringFilter mention that the plugin implements a variant of OPIC from Artiboul et al's paper. What exactly is different? How does the difference affect the scores? As it is now, the implementation doesn't preserve the total cash value in the system, and also there is almost no smoothing between the iterations (Abiteboul's history). As a consequence, scores may (and do) vary dramatically between iterations, and they don't converge to stable values, i.e. they always increase. For pages that get a lot of score contributions from other pages this leads to an explosive increase into the range of thousands or eventually millions. This means that the scores produced by the OPIC plugin exaggerate score differences between pages more and more, even if the web graph that you crawl is stable. In a sense, to follow the cash analogy, our implementation of OPIC illustrates a runaway economy - galloping inflation, rich get richer and poor get poorer ;) Also, there's a comment in the code: // XXX (ab) no adjustment? I think this is contrary to the algorithm descr. // XXX in the paper, where page loses its score if it's distributed to // XXX linked pages... Is this something that will be looked at eventually or is the scoring good enough at the moment without some adjustment. Yes, I'll start working on it when I get back from vacations. I did some simulations that show how to fix it (see http://wiki.apache.org/nutch/FixingOpicScoring bottom of the page). Andrzej, nice to see you working on this. There is one thing that I don't understand about your presentation. Assume that page A is the only url in our crawldb and it contains n outlinks. t = 0 - Generate runs, A is generated. t = 1 - Page A is fetched and its cash is distributed to its outlinks. t = 2 - Generate runs, pages P0-Pn are generated. t = 3 - P0 - Pn are fetched and their cash are distributed to their outlinks. - At this time, it is possible that page Pk links to page A. So, now Page A's cash 0. t = 4 - Generate runs, page A is considered but is not generated (since its next fetch time is later than current time). - Won't page A become a temporary sink? Time between subsequent fetches may be as large as 30 days in default configuration. So, page A will accumulate cash for a long time without distributing it. - I don't see how we can achieve that, but, IMO, if a page is considered but not generated, nutch should distribute its cash to outlinks the outlinks that are stored in its parse data. (I know that this is incredibly hard (if not impossible) to do this.) Or am I missing something here? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: OPIC scoring differences
Hi, On 7/9/07, Carl Cerecke [EMAIL PROTECTED] wrote: Hi, The docs for the OPICScoringFilter mention that the plugin implements a variant of OPIC from Artiboul et al's paper. What exactly is different? How does the difference affect the scores? Also, there's a comment in the code: // XXX (ab) no adjustment? I think this is contrary to the algorithm descr. // XXX in the paper, where page loses its score if it's distributed to // XXX linked pages... Is this something that will be looked at eventually or is the scoring good enough at the moment without some adjustment. I certainly hope that this is something that will be looked at eventually. IMHO, scoring is not good enough, but it doesn't bother anyone enough so that they decide to fix it. Also, see Andrzej's comments in NUTCH-267 about why plugin scoring-opic is not really OPIC. It is basically a glorified link counter. Cheers, Carl. -- Doğacan Güney
Re: NUTCH-119 :: how hard to fix
On 6/27/07, Kai_testing Middleton [EMAIL PROTECTED] wrote: wow, setting db.max.outlinks.per.page immediately fixed my problem. It looks like I totally mis-diagnosed things. May I pose two questions: 1) how did you view all the outlinks? bin/nutch plugin parse-html org.apache.nutch.parse.html.HtmlParser local_file 2) how severe is NUTCH-119 - does it occur on a lot of sites? AFAIK, HtmlParser doesn't extract urls with regexps. Nutch uses a regexp to extract outlinks from files that have no markup information (such as plain text). See OutlinkExtractor.java. - Original Message From: Doğacan Güney [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, June 26, 2007 10:56:32 PM Subject: Re: NUTCH-119 :: how hard to fix On 6/27/07, Kai_testing Middleton [EMAIL PROTECTED] wrote: I am evaluating nutch+lucene as a crawl and search solution. However, I am finding major bugs in nutch right off the bat. In particular, NUTCH-119: nutch is not crawling relative URLs. I have some discussion of it here: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html Most of the links off www.variety.com, one of my main test sites, have relative URLs. It seems incredible that nutch, which is capable of mapreduce, cannot fetch these URLs. It could be that I would fix this bug if, for other reasons, I decide to go with nutch+lucene. Has anyone tried fixing this problem? Is it intractable? Or are the developers, who are just volunteers anyway, more interested in fixing other problems? Could someone outline the issue for me a bit more clearly so I would know how to evaluate it? Both this one and the other site you were mentioning (sf911truth) have more than 100 outlinks. Nutch, by default, only stores 100 outlinks per page (db.max.outlinks.per.page). Link about.html happens to be 105th link or so, so nutch doesn't store it. All you have to do is either increase db.max.outlinks.per.page or set it to -1 (which means, store all outlinks). Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. http://autos.yahoo.com/green_center/ -- Doğacan Güney Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out. http://answers.yahoo.com/dir/?link=listsid=396545433 -- Doğacan Güney
Re: [jira] Commented: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly
On 6/28/07, Hudson (JIRA) [EMAIL PROTECTED] wrote: [ https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508747 ] Hudson commented on NUTCH-474: -- Integrated in Nutch-Nightly #131 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/131/]) *sigh* I wrote NUTCH-474 instead of NUTCH-434 in svn log. Sorry everyone... Fetcher2 sets server-delay and blocking checks incorrectly -- Key: NUTCH-474 URL: https://issues.apache.org/jira/browse/NUTCH-474 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Doğacan Güney Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: fetcher2.patch 1) Fetcher2 sets server delay incorrectly. It sets the delay to minCrawlDelay if maxThreads == 1 and to crawlDelay otherwise. Correct behaviour should be the opposite. 2) Fetcher2 sets wrong configuration options so host blocking is still handled by the lib-http plugin (Fetcher2 is designed to handle blocking internally). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney
JIRA email question
Hi list, There is this sentence at the end of every JIRA message: You can reply to this email to add a comment to the issue online. But, replying to a JIRA message through nutch-dev doesn't add it as a comment. So you have to either reply to an email through JIRA (in which case, it looks like you are responding to an imaginary person:) or through email (in which case, part of the discussion doesn't get documented in JIRA). Why doesn't this work? -- Doğacan Güney
Re: NUTCH-119 :: how hard to fix
On 6/27/07, Kai_testing Middleton [EMAIL PROTECTED] wrote: I am evaluating nutch+lucene as a crawl and search solution. However, I am finding major bugs in nutch right off the bat. In particular, NUTCH-119: nutch is not crawling relative URLs. I have some discussion of it here: http://www.mail-archive.com/[EMAIL PROTECTED]/msg08644.html Most of the links off www.variety.com, one of my main test sites, have relative URLs. It seems incredible that nutch, which is capable of mapreduce, cannot fetch these URLs. It could be that I would fix this bug if, for other reasons, I decide to go with nutch+lucene. Has anyone tried fixing this problem? Is it intractable? Or are the developers, who are just volunteers anyway, more interested in fixing other problems? Could someone outline the issue for me a bit more clearly so I would know how to evaluate it? Both this one and the other site you were mentioning (sf911truth) have more than 100 outlinks. Nutch, by default, only stores 100 outlinks per page (db.max.outlinks.per.page). Link about.html happens to be 105th link or so, so nutch doesn't store it. All you have to do is either increase db.max.outlinks.per.page or set it to -1 (which means, store all outlinks). Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. http://autos.yahoo.com/green_center/ -- Doğacan Güney
Re: Found the bug in Generator when number of URLs is small
On 6/21/07, Vishal Shah [EMAIL PROTECTED] wrote: Hi, I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes. After the first job finishes running, the generator checks the following condition to see if it got an empty list: if (readers == null || readers.length == 0 || !readers[0].next(new FloatWritable())) { The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions). I fixed this problem by replacing the following code: // check that we selected at least some entries ... SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job, tempDir); if (readers == null || readers.length == 0 || !readers[0].next(new FloatWritable())) { LOG.warn(Generator: 0 records selected for fetching, exiting ...); LockUtil.removeLockFile(fs, lock); fs.delete(tempDir); return null; } With the following code: // check that we selected at least some entries ... SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job, tempDir); boolean empty = true; if (readers != null readers.length 0) { for (int num=0; numreaders.length; num++){ if (readers[num].next(new FloatWritable())) { empty = false; break; } } } if (empty) { LOG.warn(Generator: 0 records selected for fetching, exiting ...); LockUtil.removeLockFile(fs, lock); fs.delete(tempDir); return null; } This seems to do the trick. Nice catch. Can you open a JIRA issue and attach a patch there? Regards, -vishal. -- Doğacan Güney
Re: Build failed in Hudson: Nutch-Nightly #123
Publishing Javadoc Recording test results This is rather strange. Here is part of the console output: test: [echo] Testing plugin: parse-swf [junit] Running org.apache.nutch.parse.swf.TestSWFParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.315 sec [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.387 sec init: [junit] Test org.apache.nutch.parse.feed.TestFeedParser FAILED SWFParser fails one of the unit tests but the report says that FeedParser has failed even though it has actually passed its test: test: [echo] Testing plugin: feed [junit] Running org.apache.nutch.parse.feed.TestFeedParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.304 sec -- Doğacan Güney
Re: Build failed in Hudson: Nutch-Nightly #123
On 6/20/07, Doğacan Güney [EMAIL PROTECTED] wrote: This is rather strange. Here is part of the console output: test: [echo] Testing plugin: parse-swf [junit] Running org.apache.nutch.parse.swf.TestSWFParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.315 sec [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.387 sec init: [junit] Test org.apache.nutch.parse.feed.TestFeedParser FAILED SWFParser fails one of the unit tests but the report says that FeedParser has failed even though it has actually passed its test: test: [echo] Testing plugin: feed [junit] Running org.apache.nutch.parse.feed.TestFeedParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.304 sec (ant test forks processes to test code, that's why we are seeing test outputs out of order.) Anyway, it is not TestSWFParser but TestFeedParser that fails. I am trying to understand why it fails. Chris, can you lend me a hand here? -- Doğacan Güney
Re: Build failed in Hudson: Nutch-Nightly #123
On 6/20/07, Chris Mattmann [EMAIL PROTECTED] wrote: Doğacan, This is strange indeed. I noticed this during my testing of parse-feed, however, thought it was an anomaly. I got this same strange cryptic unit test error message, and then after some frustration figuring it out, I did ant clean, then ant compile-core test, and miraculously the error seemed to go away. Also, if you go into $NUTCH/src/plugin/feed/ and run ant clean test (of course after running ant compile-core from the top-level $NUTCH dir), the unit tests seem to pass? [XXX:src/plugin/feed] mattmann% pwd /Users/mattmann/src/nutch/src/plugin/feed [XXX:src/plugin/feed] mattmann% ant clean test Searching for build.xml ... Buildfile: /Users/mattmann/src/nutch/src/plugin/feed/build.xml clean: [delete] Deleting directory /Users/mattmann/src/nutch/build/feed [delete] Deleting directory /Users/mattmann/src/nutch/build/plugins/feed init: [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/classes [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test/data [copy] Copying 1 file to /Users/mattmann/src/nutch/build/feed/test/data init-plugin: deps-jar: compile: [echo] Compiling plugin: feed [javac] Compiling 2 source files to /Users/mattmann/src/nutch/build/feed/classes compile-test: [javac] Compiling 1 source file to /Users/mattmann/src/nutch/build/feed/test jar: [jar] Building jar: /Users/mattmann/src/nutch/build/feed/feed.jar deps-test: init: init-plugin: compile: jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: protocol-file jar: deps-test: deploy: copy-generated-lib: deploy: [mkdir] Created dir: /Users/mattmann/src/nutch/build/plugins/feed [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed copy-generated-lib: [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed [copy] Copying 2 files to /Users/mattmann/src/nutch/build/plugins/feed test: [echo] Testing plugin: feed [junit] Running org.apache.nutch.parse.feed.TestFeedParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.663 sec BUILD SUCCESSFUL Total time: 3 seconds [XXX:src/plugin/feed] mattmann% Any ideas? It never passes for me (not even when I do it in src/plugin/feed). If you check the output, parseResult only contains a single entry which is rsstest.rss. I think what causes this bug is (surprise, surprise) PrefixURLFilter. We don't have a template for prefix-urlfilter.txt in conf, so it doesn't get properly initialized and (I can't figure out why but) randomly filters out stuff. When I put a sample prefix-urlfilter.txt(*) under conf, all tests seem to pass. (*) As your friendly neighborhood Nutch developer, I even put up a sample file at: http://www.ceng.metu.edu.tr/~e1345172/prefix-urlfilter.txt Cheers, Chris On 6/20/07 6:04 AM, Doğacan Güney [EMAIL PROTECTED] wrote: On 6/20/07, Doğacan Güney [EMAIL PROTECTED] wrote: This is rather strange. Here is part of the console output: test: [echo] Testing plugin: parse-swf [junit] Running org.apache.nutch.parse.swf.TestSWFParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.315 sec [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.387 sec init: [junit] Test org.apache.nutch.parse.feed.TestFeedParser FAILED SWFParser fails one of the unit tests but the report says that FeedParser has failed even though it has actually passed its test: test: [echo] Testing plugin: feed [junit] Running org.apache.nutch.parse.feed.TestFeedParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.304 sec (ant test forks processes to test code, that's why we are seeing test outputs out of order.) Anyway, it is not TestSWFParser but TestFeedParser that fails. I am trying to understand why it fails. Chris, can you lend me a hand here? -- Doğacan Güney __ Chris A. Mattmann [EMAIL PROTECTED] Key Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -- Doğacan Güney
Re: Build failed in Hudson: Nutch-Nightly #123
On 6/20/07, Dennis Kubes [EMAIL PROTECTED] wrote: Is this the same java 6 error that was popping up a while back? For some reason with java 6 the XML is being parsed differently in the SWF parser and therefore unit tests looking for exact strings were failing. Could this be happening in the feed parser as well? I ran into some other issues with Java 6 (backward compatibility, right...), so I actually switched my Java back to 5, at least for this computer. Dennis Kubes Chris Mattmann wrote: Doğacan, This is strange indeed. I noticed this during my testing of parse-feed, however, thought it was an anomaly. I got this same strange cryptic unit test error message, and then after some frustration figuring it out, I did ant clean, then ant compile-core test, and miraculously the error seemed to go away. Also, if you go into $NUTCH/src/plugin/feed/ and run ant clean test (of course after running ant compile-core from the top-level $NUTCH dir), the unit tests seem to pass? [XXX:src/plugin/feed] mattmann% pwd /Users/mattmann/src/nutch/src/plugin/feed [XXX:src/plugin/feed] mattmann% ant clean test Searching for build.xml ... Buildfile: /Users/mattmann/src/nutch/src/plugin/feed/build.xml clean: [delete] Deleting directory /Users/mattmann/src/nutch/build/feed [delete] Deleting directory /Users/mattmann/src/nutch/build/plugins/feed init: [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/classes [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test/data [copy] Copying 1 file to /Users/mattmann/src/nutch/build/feed/test/data init-plugin: deps-jar: compile: [echo] Compiling plugin: feed [javac] Compiling 2 source files to /Users/mattmann/src/nutch/build/feed/classes compile-test: [javac] Compiling 1 source file to /Users/mattmann/src/nutch/build/feed/test jar: [jar] Building jar: /Users/mattmann/src/nutch/build/feed/feed.jar deps-test: init: init-plugin: compile: jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: protocol-file jar: deps-test: deploy: copy-generated-lib: deploy: [mkdir] Created dir: /Users/mattmann/src/nutch/build/plugins/feed [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed copy-generated-lib: [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed [copy] Copying 2 files to /Users/mattmann/src/nutch/build/plugins/feed test: [echo] Testing plugin: feed [junit] Running org.apache.nutch.parse.feed.TestFeedParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.663 sec BUILD SUCCESSFUL Total time: 3 seconds [XXX:src/plugin/feed] mattmann% Any ideas? Cheers, Chris On 6/20/07 6:04 AM, Doğacan Güney [EMAIL PROTECTED] wrote: On 6/20/07, Doğacan Güney [EMAIL PROTECTED] wrote: This is rather strange. Here is part of the console output: test: [echo] Testing plugin: parse-swf [junit] Running org.apache.nutch.parse.swf.TestSWFParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.315 sec [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.387 sec init: [junit] Test org.apache.nutch.parse.feed.TestFeedParser FAILED SWFParser fails one of the unit tests but the report says that FeedParser has failed even though it has actually passed its test: test: [echo] Testing plugin: feed [junit] Running org.apache.nutch.parse.feed.TestFeedParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.304 sec (ant test forks processes to test code, that's why we are seeing test outputs out of order.) Anyway, it is not TestSWFParser but TestFeedParser that fails. I am trying to understand why it fails. Chris, can you lend me a hand here? -- Doğacan Güney __ Chris A. Mattmann [EMAIL PROTECTED] Key Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -- Doğacan Güney
Re: Build failed in Hudson: Nutch-Nightly #123
On 6/20/07, Chris Mattmann [EMAIL PROTECTED] wrote: On 6/20/07 7:17 AM, Doğacan Güney [EMAIL PROTECTED] wrote: It never passes for me (not even when I do it in src/plugin/feed). If you check the output, parseResult only contains a single entry which is rsstest.rss. Okay, please tell me I'm not crazy here. I'm on Mac OS X 10.4, Java version: # java -version java version 1.5.0_07 Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_07-164) Java HotSpot(TM) Client VM (build 1.5.0_07-87, mixed mode, sharing) I did a fresh checkout of the Nutch trunk. Then, from that dir, I run: # ant compile-core # cd src/plugin/feed # ant clean test All tests pass? Here is a log: [XXX:~/src/nutch] mattmann% ant compile-core Searching for build.xml ... Buildfile: /Users/mattmann/src/nutch/build.xml init: [mkdir] Created dir: /Users/mattmann/src/nutch/build [mkdir] Created dir: /Users/mattmann/src/nutch/build/classes [mkdir] Created dir: /Users/mattmann/src/nutch/build/test [mkdir] Created dir: /Users/mattmann/src/nutch/build/test/classes [mkdir] Created dir: /Users/mattmann/src/nutch/build/hadoop [unjar] Expanding: /Users/mattmann/src/nutch/lib/hadoop-0.12.2-core.jar into /Users/mattmann/src/nutch/build/hadoop [untar] Expanding: /Users/mattmann/src/nutch/build/hadoop/bin.tgz into /Users/mattmann/src/nutch/bin [mkdir] Created dir: /Users/mattmann/src/nutch/build/webapps [unjar] Expanding: /Users/mattmann/src/nutch/lib/hadoop-0.12.2-core.jar into /Users/mattmann/src/nutch/build compile-core: [javac] Compiling 172 source files to /Users/mattmann/src/nutch/build/classes [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. BUILD SUCCESSFUL Total time: 3 seconds [XXX:~/src/nutch] mattmann% cd src/plugin/feed [XXX:src/plugin/feed] mattmann% ant clean test Searching for build.xml ... Buildfile: /Users/mattmann/src/nutch/src/plugin/feed/build.xml [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test/data [copy] Copying 1 file to /Users/mattmann/src/nutch/build/feed/test/data clean: [delete] Deleting directory /Users/mattmann/src/nutch/build/feed init: [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/classes [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test/data [copy] Copying 1 file to /Users/mattmann/src/nutch/build/feed/test/data init-plugin: deps-jar: compile: [echo] Compiling plugin: feed [javac] Compiling 2 source files to /Users/mattmann/src/nutch/build/feed/classes compile-test: [javac] Compiling 1 source file to /Users/mattmann/src/nutch/build/feed/test jar: [jar] Building jar: /Users/mattmann/src/nutch/build/feed/feed.jar deps-test: init: [mkdir] Created dir: /Users/mattmann/src/nutch/build/nutch-extensionpoints [mkdir] Created dir: /Users/mattmann/src/nutch/build/nutch-extensionpoints/classes [mkdir] Created dir: /Users/mattmann/src/nutch/build/nutch-extensionpoints/test init-plugin: compile: jar: [jar] Building MANIFEST-only jar: /Users/mattmann/src/nutch/build/nutch-extensionpoints/nutch-extensionpoints. jar deps-test: deploy: [mkdir] Created dir: /Users/mattmann/src/nutch/build/plugins/nutch-extensionpoints [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/nutch-extensionpoints copy-generated-lib: [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/nutch-extensionpoints init: [mkdir] Created dir: /Users/mattmann/src/nutch/build/protocol-file [mkdir] Created dir: /Users/mattmann/src/nutch/build/protocol-file/classes [mkdir] Created dir: /Users/mattmann/src/nutch/build/protocol-file/test init-plugin: deps-jar: compile: [echo] Compiling plugin: protocol-file [javac] Compiling 4 source files to /Users/mattmann/src/nutch/build/protocol-file/classes jar: [jar] Building jar: /Users/mattmann/src/nutch/build/protocol-file/protocol-file.jar deps-test: deploy: [mkdir] Created dir: /Users/mattmann/src/nutch/build/plugins/protocol-file [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/protocol-file copy-generated-lib: [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/protocol-file deploy: [mkdir] Created dir: /Users/mattmann/src/nutch/build/plugins/feed [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed copy-generated-lib: [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed [copy] Copying 2 files to /Users/mattmann/src/nutch/build/plugins/feed test: [echo] Testing plugin: feed [junit] Running org.apache.nutch.parse.feed.TestFeedParser
upgrade to hadoop-0.13?
Hi all, As you know, hadoop-0.13 was recently released and it brings some impressive improvements over hadoop-0.12.x series. So the obvious question is: should we switch to hadoop-0.13? I have tested nutch with hadoop-0.13 with all basic jobs (inject, generate, fetch, parse, updatedb, invertlinks, index, dedup) and they work fine. -- Doğacan Güney
Re: Welcome Doğacan as Nutch committer
Hi all, Thank you everyone! It has been very exciting so far and I believe that it is only going to get better from here on :) Let me introduce myself (very) shortly: I am based in Ankara, Turkey. I am 22 and currently working on my graduate degree. I hope that together we will make nutch rock even harder. -- Doğacan Güney
Re: [Fwd: Nutch 0.9 and Crawl-Delay]
Hi, On 6/4/07, Doug Cutting [EMAIL PROTECTED] wrote: Does the 0.9 crawl-delay implementation actually permit multiple threads to access a site simultaneously? AFAIK, yes. Option fetcher.threads.per.host should be greater than 1 _only_ when you are accessing a site under your control. So, all of nutch's politeness policies are pretty much ignored when fetcher.threads.per.host is greater than 1. Fetcher2 completely ignores nutch's server-delay and site's crawl-delay value if maxThreads 1 and uses another min.crawl.delay value when accessing the site. I am not sure about Fetcher but I think it is going to allow maxThreads many fetchers to access the site simultaneously then block the next one. There may be a better explanation in this post to nutch-dev: Fetcher2's delay between successive requests . Doug Original Message Subject: Nutch 0.9 and Crawl-Delay Date: Sun, 3 Jun 2007 10:50:24 +0200 From: Lutz Zetzsche [EMAIL PROTECTED] Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Dear Nutch developers, I have had problems with a Nutch based robot during the last 12 hours, which I have now solved by banning this particular bot from my server (not Nutch completely for the moment). The ilial bot, which created considerable load on my server, was using the latest Nutch version - v0.9 - which is now also supporting the crawl-delay directive in the robots.txt. The bot seems to have obeyed the directive - crawl-delay: 10 - as it visited my website every 15 seconds, which would have been ok, BUT it then submitted FIVE requests at once (see example log extract below)! 5 requests at once every 15 seconds is not acceptable on my server, which is principally serving dynamic content and is often visited by up to 10 search engines at the same time, alltogether surely creating 99.9% of the server traffic. So my suggestion is that Nutch only submits one request each time, when it detects a crawl-delay directive in the robots.txt. This is the behaviour, the MSNbot shows for example. The MSNbot also liked to submit several requests at once every few seconds, until I added the crawl-delay directive to my robots.txt. Best wishes Lutz Zetzsche http://www.sea-rescue.de/ 72.44.58.191 - - [03/Jun/2007:04:40:53 +0200] GET /english/Photos+%26+Videos/PV/ HTTP/1.0 200 13661 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; [EMAIL PROTECTED]) 72.44.58.191 - - [03/Jun/2007:04:40:53 +0200] GET /english/Links/WRGL/Countries/ HTTP/1.0 200 15048 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; [EMAIL PROTECTED]) 72.44.58.191 - - [03/Jun/2007:04:40:53 +0200] GET /islenska/Hlekkir/Brede-ger%C3%B0%20%2F%2033%20fet/ HTTP/1.0 200 60041 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; [EMAIL PROTECTED]) 66.249.72.244 - - [03/Jun/2007:04:40:55 +0200] GET /francais/Liens/Philip+Vaux/Brede%20%2F%2033%20pieds/ HTTP/1.1 200 17568 - Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 66.231.189.119 - - [03/Jun/2007:04:40:55 +0200] GET /english/Links/Martijn%20Koenraad%20Hof/Netherlands%20Antilles/Sint%20Maarten/ HTTP/1.0 200 17193 - Gigabot/2.0 (http://www.gigablast.com/spider.html) 74.6.86.105 - - [03/Jun/2007:04:40:56 +0200] GET /dansk/Links/Hermann+Apelt/ HTTP/1.0 200 30496 - Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) 72.44.58.191 - - [03/Jun/2007:04:40:53 +0200] GET /italiano/Links/Giamaica/MRCCs+%26+Stazioni+radio+costiera/ HTTP/1.0 200 16658 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; [EMAIL PROTECTED]) 72.44.58.191 - - [03/Jun/2007:04:40:53 +0200] GET /english/Links/Mauritius/Countries/Organisations/ HTTP/1.0 200 15624 - ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet startup company. For more information please visit http://www.ilial.com/crawler; http://www.ilial.com/crawler; [EMAIL PROTECTED]) -- Doğacan Güney
Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.
On 6/4/07, Sami Siren [EMAIL PROTECTED] wrote: Briggs wrote: Yeah, you are correct there. How does this thing actually even remotely begin to work on a predictable level? One crucial aspect of language identification is that the input properly encoded. There was a patch that added icu4j character set encoding detection into Nutch. I believe icu4j also offers language identification in addition to character set detection. Has anyone checked how usable the language identification from icu4j would be? There is severe problems with current language identification for CJK for example. Can you give a few links? I have looked at icu4j's API, but I haven't found any info about language identification. IBM does have something called Linguini (http://www-306.ibm.com/software/globalization/topics/linguini/index.jsp) . It doesn't seem to be open source, though. -- Sami Siren -- Doğacan Güney
Re: [jira] Commented: (NUTCH-496) ConcurrentModificationException can be thrown when getSorted() is called.
On 6/5/07, Sami Siren [EMAIL PROTECTED] wrote: I just saw this on api and assumed it had to do with detecting the language, I might be wrong: http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetMatch.html#getLanguage() I think that method is used to get detected charset's ISO code. Like, it returns tr for ISO-8859-9. That being said, language identification is a very crucial feature and if it doesn't work properly, well, someone should do something about it :). -- Sami Siren -- Doğacan Güney
Re: Plugins and Thread Safety
Hi, On 6/4/07, Briggs [EMAIL PROTECTED] wrote: So, I synchronized it and it seems that the problem has not repeated itself. I think that was it. That's great. Can you open a JIRA issue and submit a patch for this? Thanks On 6/1/07, Briggs [EMAIL PROTECTED] wrote: I will get back to you. It isn't the easiest bug to test. So, will let you know soon! On 6/1/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Briggs wrote: Oh, you want me to change the getSorted method to be synchronized? I'll put a lock in there and see what happens, if that is what you are referring to. Yes, please try this change. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Conscious decisions by conscious minds are what make reality real -- Conscious decisions by conscious minds are what make reality real -- Doğacan Güney
Re: Plugins initialized all the time!
On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote: On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: My patch is just a draft to see if we can create a better caching mechanism. There are definitely some rough edges there:) One important information: in future versions of Hadoop the method Configuration.setObject() is deprecated and then will be removed, so we have to grow our own caching mechanism anyway - either use a singleton cache, or change nearly all API-s to pass around a user/job/task context. So, we will face this problem pretty soon, with the next upgrade of Hadoop. Hmm, well, that sucks, but this is not really a problem for PluginRepository: PluginRepository already has its own cache mechanism. You are right about per-plugin parameters but I think it will be very difficult to keep PluginProperty class in sync with plugin parameters. I mean, if a plugin defines a new parameter, we have to remember to update PluginProperty. Perhaps, we can force plugins to define configuration options it will use in, say, its plugin.xml file, but that will be very error-prone too. I don't want to compare entire configuration objects, because changing irrevelant options, like fetcher.store.content shouldn't force loading plugins again, though it seems it may be inevitable Let me see if I understand this ... In my opinion this is a non-issue. Child tasks are started in separate JVMs, so the only context information that they have is what they can read from job.xml (which is a superset of all properties from config files + job-specific data + task-specific data). This context is currently instantiated as a Configuration object, and we (ab)use it also as a local per-JVM cache for plugin instances and other objects. Once we instantiate the plugins, they exist unchanged throughout the lifecycle of JVM (== lifecycle of a single task), so we don't have to worry about having different sets of plugins with different parameters for different jobs (or even tasks). In other words, it seems to me that there is no such situation in which we have to reload plugins within the same JVM, but with different parameters. Problem is that someone might get a little too smart. Like one may write a new job where he has two IndexingFilters but creates each from completely different configuration objects. Then filters some documents with the first filter and others with the second. I agree that this is a bit of a reach, but it is possible. Actually thinking a bit further into this, I kind of agree with you. I initially thought that the best approach would be to change PluginRepository.get(Configuration) to PluginRepository.get() where get() just creates a configuration internally and initializes itself with it. But then we wouldn't be passing JobConf to PluginRepository but PluginRepository would do something like a NutchConfiguration.create(), which is probably wrong. So, all in all, I've come to believe that my (and Nicolas') patch is a not-so-bad way of fixing this. It allows us to pass JobConf to PluginRepository and stops creating new PluginRepository-s again and again... What do you think? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney -- Doğacan Güney
Re: Plugins initialized all the time!
Hi, On 5/29/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote: Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch Some comments about you patch. The approach seems nice, you only check the parameters that affect plugin loading. But have in mind that the plugin themselves will configure themselves with many other parameters, so to keep things safe there should be a PluginRepository for each set of parameters (including all of them). Besides, remember that CACHE is a WeakHashMap, you are creating ad-hoc PluginProperty objects as keys, something doesn't loook right... the lifespan of those objects will be much shorter than you require, perhaps you should be using SoftReferences instead, or a simple LRU (LinkedHashMap provides that simply) cache. My patch is just a draft to see if we can create a better caching mechanism. There are definitely some rough edges there:) I don't really worry about WeakHashMap-LinkedHashMap stuff. But your approach is simple and should be faster so I guess it's OK. You are right about per-plugin parameters but I think it will be very difficult to keep PluginProperty class in sync with plugin parameters. I mean, if a plugin defines a new parameter, we have to remember to update PluginProperty. Perhaps, we can force plugins to define configuration options it will use in, say, its plugin.xml file, but that will be very error-prone too. I don't want to compare entire configuration objects, because changing irrevelant options, like fetcher.store.content shouldn't force loading plugins again, though it seems it may be inevitable Anyway, I'll try to build my own Nutch to test your patch. Thanks! -- Doğacan Güney
Re: Plugins initialized all the time!
On 5/30/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doğacan Güney wrote: My patch is just a draft to see if we can create a better caching mechanism. There are definitely some rough edges there:) One important information: in future versions of Hadoop the method Configuration.setObject() is deprecated and then will be removed, so we have to grow our own caching mechanism anyway - either use a singleton cache, or change nearly all API-s to pass around a user/job/task context. So, we will face this problem pretty soon, with the next upgrade of Hadoop. Hmm, well, that sucks, but this is not really a problem for PluginRepository: PluginRepository already has its own cache mechanism. You are right about per-plugin parameters but I think it will be very difficult to keep PluginProperty class in sync with plugin parameters. I mean, if a plugin defines a new parameter, we have to remember to update PluginProperty. Perhaps, we can force plugins to define configuration options it will use in, say, its plugin.xml file, but that will be very error-prone too. I don't want to compare entire configuration objects, because changing irrevelant options, like fetcher.store.content shouldn't force loading plugins again, though it seems it may be inevitable Let me see if I understand this ... In my opinion this is a non-issue. Child tasks are started in separate JVMs, so the only context information that they have is what they can read from job.xml (which is a superset of all properties from config files + job-specific data + task-specific data). This context is currently instantiated as a Configuration object, and we (ab)use it also as a local per-JVM cache for plugin instances and other objects. Once we instantiate the plugins, they exist unchanged throughout the lifecycle of JVM (== lifecycle of a single task), so we don't have to worry about having different sets of plugins with different parameters for different jobs (or even tasks). In other words, it seems to me that there is no such situation in which we have to reload plugins within the same JVM, but with different parameters. Problem is that someone might get a little too smart. Like one may write a new job where he has two IndexingFilters but creates each from completely different configuration objects. Then filters some documents with the first filter and others with the second. I agree that this is a bit of a reach, but it is possible. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- Doğacan Güney
Re: Plugins initialized all the time!
Hi, On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote: I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems that the plugin repository initializes itself all the timem until I get an out of memory exception. I've been seeing the code... the plugin repository mantains a map from Configuration to plugin repositories, but the Configuration object does not have an equals or hashCode method... wouldn't it be nice to add such a method (comparing property values)? Wouldn't that help prevent initializing many plugin repositories? What could be the cause to may problem? (Aaah.. so many questions... =) ) Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch Bye! -- Doğacan Güney
Re: Plugins initialized all the time!
On 5/29/07, Briggs [EMAIL PROTECTED] wrote: I have also noticed this. The code explicitly loads an instance of the plugins for every fetch (well, or parse etc., depending on what you are doing). This causes OutOfMemoryErrors. So, if you dump the heap, you can see the filter classes get loaded and the never get unloaded (they are loaded within their own classloader). So, you'll see the same class loaded thousands of time, which is bad. So, in my case, I had to change the way the plugins are loaded. Basically, I changed all the main plugin loaders (like URLFilters.java, IndexFilters.java) to be singletons with a single 'getInstance()' method on each. I don't need special configs for filters so I can deal with singletons. You'll find the heart of the problem somewhere in the extension point class(es). It calls newInstance() an aweful lot. But, the classloader (one per plugin) never gets destroyed, or something so this can be nasty. I'm still dealing with my OutOfMemory errors on parsing, yuck. Well then can you test the patch too? Nicolas's idea seems to be the right one. After this patch, I think plugin loaders will see the same PluginRepository instance. On 5/29/07, Doğacan Güney [EMAIL PROTECTED] wrote: Hi, On 5/28/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote: I'm having big troubles with nutch 0.9 that I hadn't with 0.8. It seems that the plugin repository initializes itself all the timem until I get an out of memory exception. I've been seeing the code... the plugin repository mantains a map from Configuration to plugin repositories, but the Configuration object does not have an equals or hashCode method... wouldn't it be nice to add such a method (comparing property values)? Wouldn't that help prevent initializing many plugin repositories? What could be the cause to may problem? (Aaah.. so many questions... =) ) Which job causes the problem? Perhaps, we can find out what keeps creating a conf object over and over. Also, I have tried what you have suggested (better caching for plugin repository) and it really seems to make a difference. Can you try with this patch(*) to see if it solves your problem? (*) http://www.ceng.metu.edu.tr/~e1345172/plugin_repository_cache.patch Bye! -- Doğacan Güney -- Conscious decisions by conscious minds are what make reality real -- Doğacan Güney
Re: Bug (with fix): Neko HTML parser goes on defaults.
Hi, On 5/21/07, Marcin Okraszewski [EMAIL PROTECTED] wrote: Hi, The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: HtmlParser.java:248-259). The problem is that the first feature being set thrown an exception. So, the whole setup block is skipped. The catch statement does nothing, so probably nobody noticed this. I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk contains the same code. The patch does: 1. Fixes augmentations feature. 2. Removes include-comments feature, because I couldn't find anything similar at http://people.apache.org/~andyc/neko/doc/html/settings.html 3. Prints warn message when exception is caught. Please note that now there goes a lot for messages to console (not log4j log), because report-errors feature is being set. Shouldn't it be removed? I would suggest that you open a JIRA issue and attach the patch there. For this case, there is a similar issue(with patch) at NUTCH-369. Cheers, Marcin -- Doğacan Güney
Re: retrieving original html from database
On 4/25/07, Charlie Williams [EMAIL PROTECTED] wrote: I have an index of pages from the web, a bit over 1 million. The fetch took several weeks to complete, since it was mainly over a small set of domains. Once we had a completed fetch, and index we began trying to work with the retrieved text, and found that the cached text is just that, flat text. Is the original HTML cached anywhere that it can be accessed after the intial fetch? It would be a shame to have to recrawl all those pages. We are using Nutch .8 If you have fetcher.store.content set to true then Nutch has stored a copy of all the pages in segment_dir/content. You can extract individual contents with the command ./nutch readseg -get segment_dir url -noparse -nofetch -nogenerate -noparsetext -noparsedata. Thanks for any help. -Charlie -- Doğacan Güney