RE: [jira] Resolved: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
Thanks Do?acan, much obliged. Gal. -Original Message- From: Do?acan G?ney (JIRA) [mailto:[EMAIL PROTECTED] Sent: Sunday, June 17, 2007 11:29 PM To: nutch-dev@lucene.apache.org Subject: [jira] Resolved: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object [ https://issues.apache.org/jira/browse/NUTCH- 485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Do?acan G?ney resolved NUTCH-485. - Resolution: Fixed Committed in rev 548103 with two modifications: 1) Fix whitespace issues. 2) Original patch changed CCParseFilter to return the original parse result if CCParseFilter fails. Now if CCParseFilter fails with an exception, it returns an empty parse created from the exception. Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Assignee: Do?acan G?ney Fix For: 1.0.0 Attachments: NUTCH-485.200705122151.patch, NUTCH- 485.200705130928.patch, NUTCH-485.200705130945.patch, NUTCH- 485.200705131241.patch, NUTCH-485.200705140001.patch The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
RE: Lock file problems...
I index directly to Solr. It happened to me while 2 separate indexers accessed it directly. It seemed that the Lucene index stayed hung (that's why the lock exists) until I killed the process. After that I had to re-build the index, since I was afraid it got corrupted. -Original Message- From: Briggs [mailto:[EMAIL PROTECTED] Sent: Thursday, June 07, 2007 6:21 PM To: nutch-dev@lucene.apache.org Subject: Lock file problems... I am getting these lock file errors all over the place when indexing or even creating crawldbs. It doesn't happen all the time, but sometimes it happens continuously. So, I am not quite sure how these locks are getting in there, or why they aren't getting removed. I am not sure where to go from here. My current application is designed for crawling individual domains. So, I have multiple custom crawlers that work concurrently. Each one basically does: 1) fetch 2) invert links 3) segment merge 4) index 5) deduplicate 6) merge indexes Though, I am still not 100% sure of what the indexes directory is truly for. java.io.IOException: Lock obtain timed out: [EMAIL PROTECTED]:/crawloutput/http$~~www.camlawblog.com/indexes/part- 0/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:69) at org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:526) at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:551) at org.apache.nutch.indexer.DeleteDuplicates.reduce(DeleteDuplicates.java:414 ) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:313) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155) So, has anyone seen this come up on their own implementations?
[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501914 ] Gal Nitzan commented on NUTCH-485: -- Could one of the commiters, review this patch and maybe submit it please? The patch tuches a few locations and with so many changes occuring right now it might be more complicated to fix it later... Thanks Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Fix For: 1.0.0 Attachments: NUTCH-485.200705122151.patch, NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gal Nitzan updated NUTCH-485: - Attachment: NUTCH-485.200705130928.patch Following Andrzej advice, a much cleaner code :) Attached... Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Fix For: 1.0.0 Attachments: NUTCH-485.200705122151.patch, NUTCH-485.200705130928.patch The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gal Nitzan updated NUTCH-485: - Attachment: NUTCH-485.200705131241.patch Thanks Doğacan, I missed it :( Thanks to all reviewers. Yet another patch... Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Fix For: 1.0.0 Attachments: NUTCH-485.200705122151.patch, NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, NUTCH-485.200705131241.patch The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gal Nitzan updated NUTCH-485: - Attachment: NUTCH-485.200705140001.patch Thanks Doğacan for taking the time to review the code. I agree with your comments on the usage. I run a video search and it sure going to help. The ability to discover and add content on the fly to the segment while parsing is a functionality long awaited and it all made possible after NUTCH-443... :) And yet one more update with a better description in javadoc and some fixes to indentation. Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Fix For: 1.0.0 Attachments: NUTCH-485.200705122151.patch, NUTCH-485.200705130928.patch, NUTCH-485.200705130945.patch, NUTCH-485.200705131241.patch, NUTCH-485.200705140001.patch The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Site nightly API link is broken
Hi, The link http://lucene.apache.org/nutch/nutch-nightly/docs/api/index.html is broken.
RE: Site nightly API link is broken
Truly sorry but I don't know where it should point to. -Original Message- From: Sami Siren [mailto:[EMAIL PROTECTED] Sent: Saturday, May 12, 2007 11:05 AM To: nutch-dev@lucene.apache.org Subject: Re: Site nightly API link is broken Gal Nitzan wrote: Hi, The link http://lucene.apache.org/nutch/nutch- nightly/docs/api/index.html is broken. Can you submit a patch (the xml files are under src/site). -- Sami Siren
[jira] Created: (NUTCH-484) Nutch Nightly API link is broken in site
Nutch Nightly API link is broken in site Key: NUTCH-484 URL: https://issues.apache.org/jira/browse/NUTCH-484 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Priority: Trivial Fix For: 1.0.0 The Nightly API link is broken -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Fix For: 1.0.0 The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gal Nitzan updated NUTCH-485: - Attachment: NUTCH-485.200705122151.patch Attached patch for this issue. Comments are welcome. This patch tuches a few plugins, please review Thanks, Gal Change HtmlParseFilter 's to return ParseResult object instead of Parse object -- Key: NUTCH-485 URL: https://issues.apache.org/jira/browse/NUTCH-485 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.0.0 Environment: All Reporter: Gal Nitzan Fix For: 1.0.0 Attachments: NUTCH-485.200705122151.patch The current implementation of HtmlParseFilters.java doesn't allow a filter to add parse objects to the ParseResult object. A change to the HtmlParseFilter is needed which allows the filter to return ParseResult . and ofcourse a change to HtmlParseFilters . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Why not make SOLR the Nutch SE
Hi, Since I ran into SOLR the other day I was wandering why can't we join forces between the two projects? Both projects complement to each other. Any thoughts? Gal.
RE: Injector checking for other than STATUS_INJECTED
Hi Andrzej, Does it mean that when you inject an existing (in crawldb) a URL it changes its status to STATUS_DB_UNFETCHED? Gal -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Thursday, February 15, 2007 8:47 AM To: nutch-dev@lucene.apache.org Subject: Re: Injector checking for other than STATUS_INJECTED [EMAIL PROTECTED] wrote: Hi All, I think I am missing something. In the Injector reduce code we have the following. while (values.hasNext()) { CrawlDatum val = (CrawlDatum)values.next(); if (val.getStatus() == CrawlDatum.STATUS_INJECTED) { injected = val; injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED); } else { old = val; } } CrawlDatum res = null; if (old != null) res = old; // don't overwrite existing value else res = injected; Basically if it is not just injected then don't overwrite. But I am not seeing where the input could be such that the CrawlDatum wasn't just injected and could have previous values. Is this just in case someone uses the Injector as a Reducer and not a Mapper or am I missing how this condition can occur. This handles an important case, when you inject URLs that already exist in the DB - then you have both the old value and the newly created value under the same key. In previous versions of Injector CrawlDatum-s for such URLs could be overwritten with new values, and you could lose valuable metadata accumulated in old values. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue
Thanks Dennis, it seems it did the trick. Not sure totally, but so it seems :) Gal. -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 13, 2007 11:09 PM To: nutch-dev@lucene.apache.org Subject: Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue Actually I take it back. I don't think it is the same problem but I do think it is the right solution. Dennis Kubes Dennis Kubes wrote: This has to do with HADOOP-964. Replace the jar files in your Nutch versions with the most recent versions from Hadoop. You will also need to apply NUTCH-437 patch to get Nutch to work with the most recent changes to the Hadoop codebase. Dennis Kubes Gal Nitzan wrote: Hi, Does anybody uses Nutch trunk? I am running nutch 0.9 and unable to fetch. after 50-60K urls I get NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue every time. I was wandering if anyone have a work around or maybe something is wrong with my setup. I have opened a new issue in jira http://issues.apache.org/jira/browse/hadoop-1008 for this. Any clue? Gal
RE: hadoop-site.xml - absolute Path
Hi Tobias, The property should go in nutch-site.xml and you can see a sample for it in nutch-default.xml HTH, Gal -Original Message- From: Tobias Zahn [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 13, 2007 12:30 AM To: nutch-dev@lucene.apache.org Subject: hadoop-site.xml - absolute Path Hello out there, sorry for mailing to this list another time. I'm not sure if I'm not working carefully enough or something, but I'm facing even more problems. I put a new property in conf/hadoop-site.xml, according to the examples in hadoop-default.xml. The new property contains the path to a configuration file for a plugin. In that entry occurs: 2007-02-12 22:38:00,246 FATAL api.RegexURLFilterBase - Can't find resource: $CORRECT-AND-EXISTING-PATH No I wonder, if: 1) I can't extend api.RegexURLFilterBase and use another config file or something similar 2) I can't use an absolute path for my properties. It would be great if anyone is interested in that plugin and would like to help me finding my errors. Please contact me, I'll mail you the source (something around 100lines). [The plugin will make it possible to index only some files, according to an regex file - similar to urlfilter-regex]. Best regards, Tobias Zahn
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471747 ] Gal Nitzan commented on NUTCH-443: -- Actually, I have tested Rome after feedparser failed with OutOfMemoy. Rome has the same problem as feedparser, both convert the feed to jdom first :(. I had to write my own implementation for rss parser with Stax. Not Rome and neither feedparser could handle a 100K items feed, which isn't (probably) the common use case however it is not that far fetched use case. HTH Gal. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
NPE while fetching
Hi, I experience NPE while fetching I use Nutch trunk (a week ago) with Hadoop 0.11.1 java.lang.NullPointerException at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java: 2392) at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2087) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:498 ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1372) Any pointers to the cause? Thanks, Gal.
Re: RSS-fecter and index individul-how can i realize this function
Hi, IMO it should stay the same. URL as the key and in the filter each item link element becomes the key. I will be happy to convert the current parse-rss filter to the suggested implementation. Gal. -- Original Message -- Received: Tue, 06 Feb 2007 10:36:03 AM IST From: Doğacan Güney [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi, Doug Cutting wrote: Doğacan Güney wrote: I think it would make much more sense to change parse plugins to take content and return Parse[] instead of Parse. You're right. That does make more sense. OK, then should I go forward with this and implement something? This should be pretty easy, though I am not sure what to give as keys to a Parse[]. I mean, when getParse returned a single Parse, ParseSegment output them as url, Parse. But, if getParse returns an array, what will be the key for each element? Something like url#i, Parse[i] may work, but this may cause problems in dedup(for example, assume we fetched the same rss feed twice, and indexed them in different indexes. Two version's url#0 may be different items but since they have the same key, dedup will delete the older). -- Doğacan Güney Doug
Generator.java bug?
Hi, After many failures of generate Generator: 0 records selected for fetching, exiting ... I made a post about it a few days back. I narrowed down to the following function: public Path generate(Path dbDir, Path segments, int numLists, long topN, long curTime, boolean filter, boolean force) in the following if: if (readers == null || readers.length == 0 || !readers[0].next(new FloatWritable())) It turns out that the: !readers[0].next(new FloatWritable()) is the culprit. Gal
RE: Generator.java bug?
PS. In the following code: if (readers == null || readers.length == 0 || !readers[0].next(new FloatWritable())) { LOG.warn(Generator: 0 records selected for fetching, exiting ...); LockUtil.removeLockFile(fs, lock); fs.delete(tempDir); return null; } There is no need for the if here if (readers!=null) for (int i = 0; i readers.length; i++) readers[i].close(); -Original Message- From: Gal Nitzan [mailto:[EMAIL PROTECTED] Sent: Friday, February 02, 2007 1:56 PM To: nutch-dev@lucene.apache.org Subject: Generator.java bug? Hi, After many failures of generate Generator: 0 records selected for fetching, exiting ... I made a post about it a few days back. I narrowed down to the following function: public Path generate(Path dbDir, Path segments, int numLists, long topN, long curTime, boolean filter, boolean force) in the following if: if (readers == null || readers.length == 0 || !readers[0].next(new FloatWritable())) It turns out that the: !readers[0].next(new FloatWritable()) is the culprit. Gal
RE: Generator.java bug?
Hi Andrzej, Well on my system the list does contains urls and the fetcher does fetch it correctly, however if I keep that test in the if it will report the list is empty. I am not sure but maybe the first value is not a FloatWritable or maybe something else? Thanks, Gal -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Friday, February 02, 2007 3:28 PM To: nutch-dev@lucene.apache.org Subject: Re: Generator.java bug? Gal Nitzan wrote: Hi, After many failures of generate Generator: 0 records selected for fetching, exiting ... I made a post about it a few days back. I narrowed down to the following function: public Path generate(Path dbDir, Path segments, int numLists, long topN, long curTime, boolean filter, boolean force) in the following if: if (readers == null || readers.length == 0 || !readers[0].next(new FloatWritable())) It turns out that the: !readers[0].next(new FloatWritable()) is the culprit. Well, this condition simply checks if the result is not empty. When we open Reader[] on a SequenceFile, each reader corresponds to a part-x. There must be at least one part, so we use the one at index 0. If we cannot retrieve at least one entry from it, then it logically follows that the file is empty, and we bail out. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: RSS-fecter and index individul-how can i realize this function
Hi Chris, I'm sorry I wasn't clear enough. What I mean is that in the current implementation: 1. The RSS (channels, items) page ends up as one Lucene document in the index. 2. Indeed the links are extracted and each item link will be fetched in the next fetch as a separate page and will end up as one Lucene document. IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Another issue that arises from rss feeds is that once the feed page is fetched you can not re-fetch it until its time to fetch expired. The feeds TTL is usually very short. Since for now in Nutch, all pages created equal :) it is one more thing to think about. HTH, Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 7:01 PM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi Gal, et al., I'd like to be explicit when we talk about what the issue with the RSS parsing plugin is here; I think we have had conversations similar to this before and it seems that we keep talking around each other. I'd like to get to the heart of this matter so that the issue (if there is an actual one) gets addressed ;) Okay, so you mention below that the thing that you see missing from the current RSS parsing plugin is the ability to store data in the CrawlDatum, and parse it in the next fetch phase. Well, there are 2 options here for what you refer to as it: 1. If you're talking about the RSS file, then in fact, it is parsed, and its data is stored in the CrawlDatum, akin to any other form of content that is fetched, parsed and indexed. 2. If you're talking about the item links within the RSS file, in fact, they are parsed (eventually), and their data stored in the CrawlDatum, akin to any other form of content that is fetched, parsed, and indexed. This is accomplished by adding the RSS items as Outlinks when the RSS file is parsed: in this fashion, we go after all of the links in the RSS file, and make sure that we index their content as well. Thus, if you had an RSS file R that contained links in it to a PDF file A, and another HTML page P, then not only would R get fetched, parsed, and indexed, but so would A and P, because they are item links within R. Then queries that would match R (the physical RSS file), would additionally match things such as P and A, and all 3 would be capable of being returned in a Nutch query. Does this make sense? Is this the issue that you're talking about? Am I nuts? ;) Cheers, Chris On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote: Hi, Many sites provide RSS feeds for several reasons, usually to save bandwidth, to give the users concentrated data and so forth. Some of the RSS files supplied by sites are created specially for search engines where each RSS item represent a web page in the site. IMHO the only thing missing in the parse-rss plugin is storing the data in the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new flag to CrawlDatum, that would flag the URL as parsable not fetchable? Just my two cents... Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 31, 2007 8:44 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely
RE: RSS-fecter and index individul-how can i realize this function
Hi, Many sites provide RSS feeds for several reasons, usually to save bandwidth, to give the users concentrated data and so forth. Some of the RSS files supplied by sites are created specially for search engines where each RSS item represent a web page in the site. IMHO the only thing missing in the parse-rss plugin is storing the data in the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new flag to CrawlDatum, that would flag the URL as parsable not fetchable? Just my two cents... Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 31, 2007 8:44 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班 延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1 月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部 的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com
Generator: 0 records selected for fetching, exiting
hi, ENV: FC6 JVM 1.6 Nutch trunk with hadoop .10.1 when running generate I get the following msg: Generator: 0 records selected for fetching, exiting When using readdb there are unfetched urls Statistics for CrawlDb: vcrawldb TOTAL urls: 1525 retry 0:1525 min score: 0.0060 avg score: 0.166 max score: 1.195 status 1 (db_unfetched):1338 status 2 (db_fetched): 127 status 3 (db_gone): 60 CrawlDb statistics: done Any idea?
RE: parse-rss make them items as different pages
Hi Kauu, The functionality you require doesn't exist in the current parse-rss plugin. I need the same functionality but it doesn't exist and I believe it's not a simple task. The functionality required basically is to create a page in a segment for each item and the URL to the crawldb. Since the data already exists in the item element there is no reason to fetch the page (item). After that the only thing left is to index it. Any thoughts on how to achieve that goal? Gal. -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: Friday, January 26, 2007 4:17 AM To: nutch-dev@lucene.apache.org Subject: parse-rss make them items as different pages i want to crawl the rss feeds and parse them ,then index them and at last when search the content I just want that the hit just like an individual page. i don't know wether i tell u clearly. item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125/n247833568.shtml/link category搜狐焦点图新闻/category author[EMAIL PROTECTED]/author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate commentshttp://comment.news.sohu.com/comment/topic.jsp?id=247833847/comments /item this one item in an rss file i want nutch deal with an item like an individual page. so i search something in this item,the nutch return it as a hit. so ... any one can tell me how to do about ? any reply will be appreciated -- www.babatu.com
record version mismatch occured
Trying to mergesegs I get the following, any idea? A record version mismatch occured. Expecting v4, found v5 at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:147) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1 175) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1258) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRea der.java:69) at org.apache.nutch.segment.SegmentMerger$ObjectInputFormat$1.next(SegmentMerge r.java:139) at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:201) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:213) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1211)
RE: record version mismatch occured
Thanks Sami, By redo do you mean re-parse or re-fetch + re-parse -Original Message- From: Sami Siren [mailto:[EMAIL PROTECTED] Sent: Friday, January 26, 2007 10:49 PM To: nutch-dev@lucene.apache.org Subject: Re: record version mismatch occured Gal Nitzan wrote: Got it. I used latest trunk for a few hours and it seems that it changed the version of Crawldatum to ver. 5 :( Earlier one left too early, one(ore more) of your segments has data written with newer version. If you haven't updated crawldb then you just need to redo that(those) segment(s). -- Sami Siren
java.io.FileNotFoundException: / (Is a directory)
Just installed latest from trunk. I run mergesegs and I get the following error in all tasks log files (I use default log4j.properties): log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: / (Is a directory) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:177) at java.io.FileOutputStream.(FileOutputStream.java:102) at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163) at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp pender.java:215) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132 ) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav a:654) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav a:612) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur ator.java:509) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java: 415) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java: 441) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter. java:468) at org.apache.log4j.LogManager.(LogManager.java:122) at org.apache.log4j.Logger.getLogger(Logger.java:104) at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229) at org.apache.commons.logging.impl.Log4JLogger.(Log4JLogger.java:65) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcces sorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstruc torAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.ja va:529) at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.ja va:235) at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:370) at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:59) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1346) log4j:ERROR Either File or DatePattern options are not set for appender [DRFA].
RE: updating index without refitting
Hi, You do not mention if the new field's data is stored as a metadata? Does the value data being created during parse or is it added only during the index phase? If your new field is created during the parse process than you could delete only the parse folders and run the parse process i.e. (delete segment/crawl parse , segment/parse data , segment/parse text) and run bin/nutch parse segment Or if your field data is added during the index process than re-create your index. In any case it doesn't seem to me you would need to re-fetch. HTH Gal -Original Message- From: DS jha [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 28, 2006 4:11 PM To: nutch-dev@lucene.apache.org Subject: updating index without refetching Hi All, Is it possible to update the index without refetching everything? I have changed logic of one of my plugins (which also sets a custom field in the index) - and I would like this field to get updated without refetching everything - is it doable? Thanks,
RE: Error with Hadoop-0.4.0
To get the same behavior, just try to inject to a new crawldb that doesn't exist. The reason many doesn't get it is that crawldb already exists in their environment. -Original Message- From: Sami Siren [mailto:[EMAIL PROTECTED] Sent: Thursday, July 06, 2006 7:23 PM To: nutch-dev@lucene.apache.org Subject: Re: Error with Hadoop-0.4.0 Jérôme Charron wrote: Hi, I encountered some problems with Nutch trunk version. In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK 1.5 (more precisely since HADOOP-129 and File replacement by Path). Does somebody have the same error? I am not seeing this (just run inject on a single machine(linux) configuration, local fs without problems ). -- Sami Siren
RE: search speed
Hi, DFS is too slow for the search. What we did, was extracted the segments to the local FS i.e. to the hard disk. Each machine has 2X300GB HD in raid. Bin/hadoop dfs -get index /nutch/index Bin/hadoop dfs -get linkdb /nutch/linkdb Bin/hadoop dfs -get segments /nutch/segments When we run out of disk space for the segments on one web server, we add another web server, use mergesegs to split the segments and use the distributed search. HTH -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 10:09 AM To: nutch-dev@lucene.apache.org Subject: search speed I using dfs. My index contain 3706249 documents. Presently, searching for occupies from 2 before 4 seconds (I test on query with 3 search term). Tomcat started on box with cpu Dual Opteron 2.4 GHz and 16 GB Ram. I think search is very slow now. We can make search faster? What factors influence on search speed?
RE: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch?
In my company we changed the default and many other probably did the same. However, we must not ignore the behavior of the irresponsible users of Nutch. And for that reason the use of the default must be blocked in code. Just my 2 cents. -Original Message- From: Michael Wechner [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 9:30 AM To: nutch-dev@lucene.apache.org Subject: Re: IncrediBILL's Random Rants: How Much Nutch is TOO MUCH Nutch? Doug Cutting wrote: http://incredibill.blogspot.com/2006/06/how-much-nutch-is-too-much-nutch.htm l well, I think incrediBILL has an argument, that people might really start excluding bots from their servers if it's becoming too much. What might help is that incrediBILL would offer an index of the site, which should be smaller than the site itself. I am not sure if there exists a standard for something like this. Basically the bot would ask the server if an index exists and where it is located and what the date it is from and then the bot decides to download the index or otherwise starts crawling the site. Michi -- Michael Wechner Wyona - Open Source Content Management -Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED][EMAIL PROTECTED] +41 44 272 91 61
RE: NPE When using a merged segment
I think it is a bug. It saves the old segment name instead of replacing it with the new segment name -Original Message- From: Dominik Friedrich [mailto:[EMAIL PROTECTED] Sent: Monday, May 29, 2006 7:57 PM To: nutch-dev@lucene.apache.org Subject: Re: NPE When using a merged segment I have the same problem with a merged segment. I had a look with luke at the index and it seems that the indexer puts the old segment names in there instead of the name of the merged segment. I'm not sure if I did something wrong or if this is a bug. Dominik Gal Nitzan schrieb: Hi, I have built a new index based on the new segment only. -Original Message- From: Stefan Neufeind [mailto:[EMAIL PROTECTED] Sent: Monday, May 29, 2006 10:03 AM To: nutch-dev@lucene.apache.org Subject: Re: NPE When using a merged segment Gal Nitzan wrote: Hi, After using mergesegs to merge all my segments to one segment only, I moved the new segment to segments. When accessing the web UI I get: java.lang.RuntimeException: java.lang.NullPointerException org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:20 3) org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:175) Hi, I'm not sure - but have you tried reindexing that new segment? To my understanding the index holds refereences to the segment (segment-name) - and in your case those are invalid. This would also explain the error you get (in call to getSummary) because the summary is fetched from the segment. If this works, then maybe you'll need to find a better way of cleaning up the index - not reindexing everything but maybe just rewriting the segmeent-names all into one or so. Feedback welcome. Good luck, Stefan
NPE When using a merged segment
Hi, After using mergesegs to merge all my segments to one segment only, I moved the new segment to segments. When accessing the web UI I get: java.lang.RuntimeException: java.lang.NullPointerException org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:20 3) org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329) org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:175) Gal.
RE: Where exactly nutch scoring takes place ?
Hi, The scoring in Nutch-08 is done in a plugin: scoring-opic. It is called from Indexr.java HTH -Original Message- From: ahmed ghouzia [mailto:[EMAIL PROTECTED] Sent: Friday, May 26, 2006 3:16 PM To: nutch-user@lucene.apache.org; nutch-dev@incubator.apache.org Subject: Where exactly nutch scoring takes place ? I want to use nutch as an environment to test my proposed algorithm for web mining 1- Where exactly does the nutch score take place ? in which packages or files? 2- Can the LinkAnalysisTool be run at the intranet level?, some documents mentioned that it can take place only at the whole web crawling level 3- what technologies and concepts that i must be familiar with to get into nuch development? is it only jsp, servlet ro anything else ? - Be a chatter box. Enjoy free PC-to-PC calls with Yahoo! Messenger with Voice.
[jira] Commented: (NUTCH-284) NullPointerException during index
[ http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12413231 ] Gal Nitzan commented on NUTCH-284: -- I just had somthing similar. Try the following: run ant on each of your tasktrackers machines: % ant than restart your nutch and try again. I think there is a problem with the classpath NullPointerException during index - Key: NUTCH-284 URL: http://issues.apache.org/jira/browse/NUTCH-284 Project: Nutch Type: Bug Components: indexer Versions: 0.8-dev Reporter: Stefan Neufeind For quite a few this reduce sort has been going on. Then it fails. What could be wrong with this? 060524 212613 reduce sort 060524 212614 reduce sort 060524 212615 reduce sort 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212615 found resource common-terms.utf8 at file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8 060524 212619 Optimizing index. 060524 212619 job_jlbhhm java.lang.NullPointerException at org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269) at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.indexer.Indexer.index(Indexer.java:287) at org.apache.nutch.indexer.Indexer.main(Indexer.java:304) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: error
A new plugin was added to code base. You need to add to tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml property plugin.includes a new entry summary-basic or summary-lucene. HTH -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, May 22, 2006 11:39 AM To: nutch-dev@lucene.apache.org Subject: error I updated any plugins... And now I get errors in tomcat log: May 22, 2006 3:28:50 AM org.apache.nutch.plugin.PluginRepository init SEVERE: org.apache.nutch.plugin.PluginRuntimeException: Plugin (summary-basic), extension point: org.apache.nutch.searcher.Summarizer does not exist. How fix this problem?
[jira] Commented: (NUTCH-271) Meta-data per URL/site/section
[ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412435 ] Gal Nitzan commented on NUTCH-271: -- This functionality is already available in Nutch-0.8 Meta-data per URL/site/section -- Key: NUTCH-271 URL: http://issues.apache.org/jira/browse/NUTCH-271 Project: Nutch Type: New Feature Versions: 0.7.2 Reporter: Stefan Neufeind We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a workaround I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g. http://www.example1.com/something1/ - meta-tag companybranch1 http://www.example2.com/something2/ - meta-tag companybranch2 http://www.example3.com/something3/ - meta-tag companybranch1 http://www.example4.com/something4/ - meta-tag companybranch3 search for everything in companybranch1 or across 1 and 3 or similar -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-271) Meta-data per URL/site/section
[ http://issues.apache.org/jira/browse/NUTCH-271?page=comments#action_12412436 ] Gal Nitzan commented on NUTCH-271: -- Sorry for the short comment. Actually the meta tags functionality is already available in the 0.8 version along with a CrawlDatum object. You can build the required functionality just by developing plugins for parsing indexing and querying HTH. Meta-data per URL/site/section -- Key: NUTCH-271 URL: http://issues.apache.org/jira/browse/NUTCH-271 Project: Nutch Type: New Feature Versions: 0.7.2 Reporter: Stefan Neufeind We have the need to index sites and attach additional meta-data-tags to them. Afaik this is not yet possible, or is there a workaround I don't see? What I think of is using meta-tags per start-url, only indexing content below that URL, and have the ability to limit searches upon those meta-tags. E.g. http://www.example1.com/something1/ - meta-tag companybranch1 http://www.example2.com/something2/ - meta-tag companybranch2 http://www.example3.com/something3/ - meta-tag companybranch1 http://www.example4.com/something4/ - meta-tag companybranch3 search for everything in companybranch1 or across 1 and 3 or similar -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: Proposal for Avoiding Content Generation Sites
Actually there is a property in conf: generate.max.per.host So if you add a message in Generator.java at the appropriate place... you have what you wish... Gal -Original Message- From: Rod Taylor [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 08, 2006 7:28 PM To: Nutch Developer List Subject: Proposal for Avoiding Content Generation Sites We've indexed several content generation sites that we want to eliminate. One had hundreds of thousands of sub-domains spread across several domains (up to 50M pages in total). Quite annoying. First is to allow for cleaning up. This consists of a new option to updatedb which can scrub the database of all URLs which no longer match URLFilter settings (regex-urlfilter.txt). This allows a change in the urlfilter to be reflected against Nutches current dataset, something I think others have asked for in the past. Second is to treat a subdomain as being in the same bucket as the domain for the generator. This means that *.domain.com or *.domain.co.uk would create 2 buckets instead of one per hostname. There is a high likely hood that sub-domains will be on the same servers as the primary domain and should be rate-limited as such. generate.max.per.host would work more as generate.max.per.domain instead. Third is ongoing detection. I would like to add a feature to Nutch which could report anomalies during updatedb or generate. For example, if any given domain.com bucket during generate is found to have more than 5000 URLs to be downloaded, it should be flagged for a manual review. Write a record to a text file which can be read and picked up by a person to confirm that we haven't gotten into a garbage content generation site. If we are in a content generation site, the person would add this domain to the urlfilter and the next updatedb would clean out all URLs from that location. Are there any thoughts or objections to this? One and 2 are pretty straight forward. Detection is not so easy. -- Rod Taylor [EMAIL PROTECTED]
Re: Unable to complete a full fetch, reason Child Error
Still got the same... I'm not sure if it is relevant to this issue but the call you added to Fetcher.java: job.setBoolean(mapred.speculative.execution, false); Doesn't work. All task trackers still fetch together though I have only 3 sites in the fetchlist. The task trackers fetch the same pages... I have used latest build from hadoop trunk. Gal. On Fri, 2006-02-24 at 14:15 -0800, Doug Cutting wrote: Mike Smith wrote: 060219 142408 task_m_grycae Parent died. Exiting task_m_grycae This means the child process, executing the task, was unable to ping its parent process (the task tracker). 060219 142408 task_m_grycae Child Error java.io.IOException: Task process exit with nonzero status. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:144) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:97) And this means that the parent was really still alive, and has noticed that the child killed itself. It would be good to know how the child failed to contact its parent. We should probably log a stack trace when this happens. I just made that change in Hadoop and will propagate it to Nutch. Doug
RE: Nutch Improvement - HTML Parser
You can always implement your own parser. On Sat, 2006-02-25 at 16:51 -0500, Fuad Efendi wrote: Let's do this, to create /to use existing/ low-level processing, I mean to use StartTag and EndTag (which could be different in case of malformed HTML), and to look at what is inside. In this case performance wil improve, and functionality, because we are not building DOM, and we are not trying to find and fix HTML errors. Of course our Tag class will have Attributes, and we will have StartTag, EndTag, etc. I call it low-level 'parsing'. Are we using DOM to parse RTF, PDF, XLS, TXT? Even inside existing parser we are using Perl5 to check some metadata, right before parsing. = Yes sure. I think everybody has already done such things at school... Building a DOM provide: 1. a better parsing of malformed html documents (and there is a lot of malformed docs on the web) 2. gives ability to extract meta-information such as creative commons license 3. gives a high degree of extensibility (HtmlParser extension point) to extract some specific informations without parsing the document many times (for instance extracting technorati like tags, ...) and just providing a simple plugin.
Re: All tasktrackers access same site at the same time (hadoop) please help
Thanks for the prompt reply. I have updated Fetcher.java and hadoop.jar from trunk but I still get the aforementioned behavior. On Wed, 2006-02-15 at 15:02 -0800, Doug Cutting wrote: Gal Nitzan wrote: I noticed all tasktrackers are participating in the fetch. I have only one site in the injected seed file I have 5 tasktrackers all except one access the same site. I just fixed a bug related to this. Please try updating. The problem was that MapReduce recently started supporting speculative execution, where, if some tasks appear to be executing slowly and there are idle nodes, then these tasks automatically are run in parallel on another node, and the results of the first that finishes are used. But this is not appropriate for fetching. So I just added a mechanism to Hadoop to disable it and then disabled it in the Fetcher. Note also that the slaves file is now located in the conf/ directory, as is a new file named hadoop-env.sh. This contains all relevant environment variables, so that we no longer have to rely on ssh's environment passing feature. Doug
Unable to complete a full fetch, reason Child Error
During fetch all tasktrackers aborting the fetch with: task_m_b45ma2 Child Error java.io.IOException: Task process exit with nonzero status. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:144) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:97)
Global locking
I have implemented a down and dirty Global Locking: I am currently testing it but I would like to get other people idea on this: I used RMI for this purpose: A RMI server which implements two methods { boolean lock(String urlString); void unlock(String urlString); } the server holds a mapkey,val where key is an Integer(host hash) the val is a very simplistic class: public class LockObj { private int hash; private long start; private long timeout; private int max_locks; private int locks = 0; private Object sync_obj = new Object(); public LockObj(int hash, long timeout, int max_locks) { this.hash = hash; this.timeout = timeout; start = new Date().getTime(); this.max_locks = max_locks; } public synchronized boolean lock() { boolean ret = false; if (locks+1 max_locks) { synchronized(sync_obj) { locks++; } ret = true; } return ret; } public synchronized void unlock() { if (locks 0) { synchronized(sync_obj) { locks--; } } } public int locks() { return locks; } // convert the host part of a url to hash // if url exception. use the string input for hash public static int make_hash(String urlString) { URL url = null; try { url = new URL(urlString); } catch (MalformedURLException e) { } return (url==null ? urlString : url.getHost()).hashCode(); } // check if this object timeout has reached. // later implement a listener event public boolean timeout_reached() { long current = new Date().getTime(); return (current - start) timeout; } // free all public void unlock_all() { synchronized(sync_obj) { while (locks != 0) locks--; } } public int hash() { return hash; } } not the prettiest thing but just finished the first barrier... it worked!!! I changed FetcherThread constructor to create an instance of SyncManager. And in also in the run method I try to get a lock on the host. If not successful I add the url into a ListArraykey,datum for a later processing... I also changed generator to put each url into a separate array so all fetchlists are even. Would appreciate your comments and any way to improve. The RMI is a little cumbersome but hay... for now it works for 5 task trackers without a problem (so it seems) :) Gal On Wed, 2006-02-15 at 14:55 -0800, Doug Cutting wrote: Andrzej Bialecki wrote: (FYI: if you wonder how it was working before, the trick was to generate just 1 split for the fetch job, which then lead to just one task being created for any input fetchlist. I don't think that's right. The generator uses setNumReduceTasks() to the desired number of fetch tasks, to control how many host-disjoint fetchlists are generated. Then the fetcher does not permit input files to be split, so that fetch tasks remain host-disjoint. So lots of splits can be generated, by default one per mapred.map.tasks, permitting lots of parallel fetching. This should still work. If it does not, I'd be interested to hear more details. Doug
Re: Global locking
well, at the moment it solve the problem I mentioned yesterday where all tasktrackers will access the same site with hadoop. it seems that the use of job.setBoolean(mapred.speculative.execution, false); didn't help and I'm not sure why. However, though it is one more software it removes the need for special treatment for fetcher, i.e. special fetch lists built by the generator. So now fetcher/tasktracker suppose to access politely to hosts but still its list contains various hosts. Sometimes I noticed that generator created a fetchlist where (only 2 hosts in the seed) were put in the same fetchlist which made only one tasktracker work instead of two. I'm sorry if It sound a little confusing :) or unreasonable... :) Gal On Thu, 2006-02-16 at 13:47 -0800, Doug Cutting wrote: Gal Nitzan wrote: I have implemented a down and dirty Global Locking: [ ... ] I changed FetcherThread constructor to create an instance of SyncManager. And in also in the run method I try to get a lock on the host. If not successful I add the url into a ListArraykey,datum for a later processing... I also changed generator to put each url into a separate array so all fetchlists are even. What problem does this fix? Doug
All tasktrackers access same site at the same time (hadoop) please help
Hi, Just installed 0-8 with hadoop from trunk. I noticed all tasktrackers are participating in the fetch. I have only one site in the injected seed file I have 5 tasktrackers all except one access the same site. I am using nu0.8 dev with hadoop. Please, any idea? Thanks.
[jira] Commented: (NUTCH-186) mapred-default.xml is over ridden by nutch-site.xml
[ http://issues.apache.org/jira/browse/NUTCH-186?page=comments#action_12364010 ] Gal Nitzan commented on NUTCH-186: -- After reading the code and I think I figured it... :) The issue of the mapred-default.xml is totaly misleading. Actualy : mapred.map.tasks and mapred.reduce.tasks properties does not have any effect when placed in mapred-default.xml (unless JobConf needs it which I didn´t check) because this file is loaded only when JobConf is constructed. But tasktracker is looking for these properties in nutch-site and not in mapred-default. If these properties does not exists in nutch-site.xm with the correct values for your system, these values will be picked from nutch-defaul.xml. Further, I am not sure that nutch-site.xml overiding everything should be the correct behavior. Most users knows that nutch-site.xml overides nutch-default but I think we should leave it up to them the option to override nutch-site and it will be a good start into breaking configuration to parts (ndfs and mapred are going to be seperated from nutch)... Gal mapred-default.xml is over ridden by nutch-site.xml --- Key: NUTCH-186 URL: http://issues.apache.org/jira/browse/NUTCH-186 Project: Nutch Type: Bug Versions: 0.8-dev Environment: All Reporter: Gal Nitzan Priority: Minor Attachments: myBeautifulPatch.patch, myBeautifulPatch.patch If mapred.map.tasks and mapred.reduce.tasks are defined in nutch-site.xml and also in mapred-default.xml the definitions from nutch-site.xml are those that will take effect. So if a user mistakenly copies those entries into nutch-site.xml from the nutch-default.xml she will not understand what happens. I would like to propose removing these setting completely from the nutch-default.xml and put it only in mapred-default.xml where it belongs. I will be happy to supply a patch for that if the proposition accepted. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-186) mapred-default.xml is over ridden by nutch-site.xml
[ http://issues.apache.org/jira/browse/NUTCH-186?page=comments#action_12363903 ] Gal Nitzan commented on NUTCH-186: -- ok, JobConf extends NutchConf and in the (JobConf) constructor it adds the mapred-default.xml resource. the call to add resource in NutchConf actually inserts any resource file before the nutch-site.xml so there is no way to override it. look at the code at the bottom. the only thing required is to change line 85 in NutchConf to be: resourceNames.add(name); // add resouce name instead of resourceNames.add(resourceNames.size()-1, name); // add second to last and add one more line to JobConf constructor addConfResource(mapred-site.xml); This way nutch-site.xml overides nutch-default.xml but other added resources can override nutch-site.xml which in my opinion is reasonable. If acceptable I will create the patch. - current code in ButchConf.Java - public synchronized void addConfResource(File file) { addConfResourceInternal(file); } private synchronized void addConfResourceInternal(Object name) { resourceNames.add(resourceNames.size()-1, name); // add second to last properties = null;// trigger reload } mapred-default.xml is over ridden by nutch-site.xml --- Key: NUTCH-186 URL: http://issues.apache.org/jira/browse/NUTCH-186 Project: Nutch Type: Bug Versions: 0.8-dev Environment: All Reporter: Gal Nitzan Priority: Minor If mapred.map.tasks and mapred.reduce.tasks are defined in nutch-site.xml and also in mapred-default.xml the definitions from nutch-site.xml are those that will take effect. So if a user mistakenly copies those entries into nutch-site.xml from the nutch-default.xml she will not understand what happens. I would like to propose removing these setting completely from the nutch-default.xml and put it only in mapred-default.xml where it belongs. I will be happy to supply a patch for that if the proposition accepted. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-186) mapred-default.xml is over ridden by nutch-site.xml
[ http://issues.apache.org/jira/browse/NUTCH-186?page=all ] Gal Nitzan updated NUTCH-186: - Attachment: myBeautifulPatch.patch the patch attached mapred-default.xml is over ridden by nutch-site.xml --- Key: NUTCH-186 URL: http://issues.apache.org/jira/browse/NUTCH-186 Project: Nutch Type: Bug Versions: 0.8-dev Environment: All Reporter: Gal Nitzan Priority: Minor Attachments: myBeautifulPatch.patch If mapred.map.tasks and mapred.reduce.tasks are defined in nutch-site.xml and also in mapred-default.xml the definitions from nutch-site.xml are those that will take effect. So if a user mistakenly copies those entries into nutch-site.xml from the nutch-default.xml she will not understand what happens. I would like to propose removing these setting completely from the nutch-default.xml and put it only in mapred-default.xml where it belongs. I will be happy to supply a patch for that if the proposition accepted. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type
Proposition: Enable Nutch to use a parser plugin not just based on content type --- Key: NUTCH-179 URL: http://issues.apache.org/jira/browse/NUTCH-179 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Gal Nitzan Somtime there are requirements of the real world (usually your boss) where a special parse is required for a certain site. Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score. Currently the ParserFactory looks for a plugin based only on the content type. Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type. The implementation shouldn be to complicated. Looking to hear more ideas. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type
[ http://issues.apache.org/jira/browse/NUTCH-179?page=all ] Gal Nitzan updated NUTCH-179: - Description: Somtime there are requirements of the real world (usually your boss) where a special parse is required for a certain site. Though the content type is text/html, a specialized parser is needed. Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score. Currently the ParserFactory looks for a plugin based only on the content type. Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type. The implementation shouldn be to complicated. Looking to hear more ideas. was: Somtime there are requirements of the real world (usually your boss) where a special parse is required for a certain site. Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score. Currently the ParserFactory looks for a plugin based only on the content type. Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type. The implementation shouldn be to complicated. Looking to hear more ideas. Proposition: Enable Nutch to use a parser plugin not just based on content type --- Key: NUTCH-179 URL: http://issues.apache.org/jira/browse/NUTCH-179 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Gal Nitzan Somtime there are requirements of the real world (usually your boss) where a special parse is required for a certain site. Though the content type is text/html, a specialized parser is needed. Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score. Currently the ParserFactory looks for a plugin based only on the content type. Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type. The implementation shouldn be to complicated. Looking to hear more ideas. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: HTMLMetaProcessor a bug?
Thanks, I was checking something with the default from jdk... On Tue, 2006-01-10 at 11:06 +0100, Jérôme Charron wrote: the following code would fail in case the meta tags are in upper case Node nameNode = attrs.getNamedItem(name); Node equivNode = attrs.getNamedItem(http-equiv); Node contentNode = attrs.getNamedItem(content); This code works well, because Nutch HTML Parser uses Xerces implementation HTMLDocumentImpl object that lowercased attributes (instead of elements names that are uppercased). For consistency and to decouple a little Nutch HTML Parser and Xerces implementation, I suggest to change these lines by something like: Node nameNode = null; Node equivNode = null; Node contentNode = null; for (int i=0; iattrs.getLength(); i++) { Node attr = attrs.item(i); String attrName = attr.getNodeName().toLowerCase(); if (attrName.equals(name)) { nameNode = attr; } else if (attrName.equals(http-equiv)) { equivNode = attr; } else if (attrName.equals(content)) { contentNode = attr; } } Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
fetch of XXX failed with: java.lang.ClassCastException: java.util.ArrayList
Hi, I traced it to ParseData line 147. UTF8.writeString(out, (String) e.getKey()); UTF8.writeString(out, (String) e.getValue()); it seems that Set-Cookie key comes with a ArrayList value?
Re: HTMLMetaProcessor a bug?
Because I needed to add two more fields from the meta tags in the html page I have revised some of the code in HTMLMetaProcessor and in DOMContentUtils. I believe it to be a little more generic than the existing code (look at DOMContentUtils.GetMetaAttributes) and from the sample here from Jérôme since the existing code can handle only http-equiv or name... Since I am not too familiar with svn. I paste it down this email, it might be useful to someone. On Tue, 2006-01-10 at 08:48 -0800, Doug Cutting wrote: Jérôme Charron wrote: For consistency and to decouple a little Nutch HTML Parser and Xerces implementation, I suggest to change these lines by something like: Node nameNode = null; Node equivNode = null; Node contentNode = null; for (int i=0; iattrs.getLength(); i++) { Node attr = attrs.item(i); String attrName = attr.getNodeName().toLowerCase(); if (attrName.equals(name)) { nameNode = attr; } else if (attrName.equals(http-equiv)) { equivNode = attr; } else if (attrName.equals(content)) { contentNode = attr; } } +1 /** * Copyright 2005 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the License); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.apache.nutch.parse.html; import java.net.URL; import java.net.MalformedURLException; import java.util.ArrayList; import java.util.HashMap; import org.apache.nutch.parse.Outlink; import org.w3c.dom.*; /** * A collection of methods for extracting content from DOM trees. * p/ * This class holds a few utility methods for pulling content out of * DOM nodes, such as getOutlinks, getText, etc. */ public class DOMContentUtils { public static class LinkParams { public String elName; public String attrName; public int childLen; public LinkParams(String elName, String attrName, int childLen) { this.elName = elName; this.attrName = attrName; this.childLen = childLen; } public String toString() { return LP[el= + elName + ,attr= + attrName + ,len= + childLen + ]; } } public static HashMap linkParams = new HashMap(); static { linkParams.put(a, new LinkParams(a, href, 1)); linkParams.put(area, new LinkParams(area, href, 0)); linkParams.put(form, new LinkParams(form, action, 1)); linkParams.put(frame, new LinkParams(frame, src, 0)); linkParams.put(iframe, new LinkParams(iframe, src, 0)); linkParams.put(script, new LinkParams(script, src, 0)); linkParams.put(link, new LinkParams(link, href, 0)); linkParams.put(img, new LinkParams(img, src, 0)); } /** * This method takes a [EMAIL PROTECTED] StringBuffer} and a DOM [EMAIL PROTECTED] Node}, * and will append all the content text found beneath the DOM node to * the codeStringBuffer/code. * p/ * p/ * p/ * If codeabortOnNestedAnchors/code is true, DOM traversal will * be aborted and the codeStringBuffer/code will not contain * any text encountered after a nested anchor is found. * p/ * p/ * * @return true if nested anchors were found */ public static final boolean getText(StringBuffer sb, Node node, boolean abortOnNestedAnchors) { if (getTextHelper(sb, node, abortOnNestedAnchors, 0)) { return true; } return false; } /** * This is a convinience method, equivalent to [EMAIL PROTECTED] * #getText(StringBuffer,Node,boolean) getText(sb, node, false)}. */ public static final void getText(StringBuffer sb, Node node) { getText(sb, node, false); } // returns true if abortOnNestedAnchors is true and we find nested // anchors private static final boolean getTextHelper(StringBuffer sb, Node node, boolean abortOnNestedAnchors, int anchorDepth) { if (script.equalsIgnoreCase(node.getNodeName())) { return false; } if (style.equalsIgnoreCase(node.getNodeName())) { return false; } if (abortOnNestedAnchors a.equalsIgnoreCase(node.getNodeName())) { anchorDepth++; if (anchorDepth 1) return true; } if (node.getNodeType() == Node.COMMENT_NODE) { return false; } if (node.getNodeType() == Node.TEXT_NODE) { // cleanup and trim the value String text = node.getNodeValue(); text = text.replaceAll(\\s+, ); text = text.trim(); if (text.length() 0) { if (sb.length() 0)
What/how num of required maps is set?
I am trying to figure out how the required map is set/calculated by Nutch. I have 3 task trackers. I added one more. When I run fetch only the initial three are fetching. I have added the task tracker before calling generate (if it has any meanning) Thanks, G.
Re: What/how num of required maps is set? OOP Wrong list
On Mon, 2006-01-09 at 12:07 +0200, Gal Nitzan wrote: I am trying to figure out how the required map is set/calculated by Nutch. I have 3 task trackers. I added one more. When I run fetch only the initial three are fetching. I have added the task tracker before calling generate (if it has any meanning) Thanks, G.
HTMLMetaProcessor a bug?
Hi, I was going over the code and I noticed the following in class org.apache.nutch.parse.html.HTMLMetaProcessor method getMetaTagsHelper the following code would fail in case the meta tags are in upper case Node nameNode = attrs.getNamedItem(name); Node equivNode = attrs.getNamedItem(http-equiv); Node contentNode = attrs.getNamedItem(content); G.
Re: NPE in Indexer.java line 184
Hi Andrzej, The value cannot be null is my message :) 060109 094543 task_r_9xvvcz Could not get property: segment name 060109 094543 task_r_9xvvcz [Ljava.lang.StackTraceElement;@154864a 060109 094543 task_r_9xvvcz java.lang.NullPointerException: value cannot be null 060109 094543 task_r_9xvvcz at org.apache.lucene.document.Field.init(Field.java:469) 060109 094543 task_r_9xvvcz at org.apache.lucene.document.Field.init(Field.java:412) 060109 094543 task_r_9xvvcz at org.apache.lucene.document.Field.UnIndexed(Field.java:195) 060109 094543 task_r_9xvvcz at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:200) 060109 094543 task_r_9xvvcz at org.apache.nutch.mapred.ReduceTask.run(ReduceTask.java:260) 060109 094543 task_r_9xvvcz at org.apache.nutch.mapred.TaskTracker $Child.main(TaskTracker.java:603) Gal On Sun, 2006-01-08 at 10:07 +0100, Andrzej Bialecki wrote: Gal Nitzan wrote: Hi While the reduce task is running I sometime get this exception and it breaks the whole job. As a work around I put this line in a try catch and just return however I was not sure why the meta can not find the segment key name. This work around is good for now. Stacktrace?
NPE in Indexer.java line 184
Hi While the reduce task is running I sometime get this exception and it breaks the whole job. As a work around I put this line in a try catch and just return however I was not sure why the meta can not find the segment key name. This work around is good for now. G.
Re: mapred crawling exception - Job failed!
Yes it was fixed. just update your code from trunk. On Wed, 2006-01-04 at 08:51 +0100, Andrzej Bialecki wrote: Lukas Vlcek wrote: Hi, I am trying to use the latest nutch-trunk version but I am facing unexpected Job failed! exception. It seems that all crawling work has been already done but some threads are hunged which results into exception after some timeout. This was fixed (or should be fixed :) in the revision r365576. Please report if it doesn't fix it for you.
NegativeArraySizeException in search server
When trying to use the search server I get. I use the trunk from today... 060104 025549 13 Server handler 0 on 9004 call error: java.io.IOException: java.lang.NegativeArraySizeException java.io.IOException: java.lang.NegativeArraySizeException at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:35) at org.apache.lucene.search.HitQueue.init(HitQueue.java:23) at org.apache.lucene.search.TopDocCollector.init(TopDocCollector.java:47) at org.apache.nutch.searcher.LuceneQueryOptimizer $LimitedCollector.init(LuceneQueryOptimizer.java:52) at org.apache.nutch.searcher.LuceneQueryOptimizer.optimize(LuceneQueryOptimizer.java:153) at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:93) at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:155) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:324) at org.apache.nutch.ipc.RPC$1.call(RPC.java:186) at org.apache.nutch.ipc.Server$Handler.run(Server.java:200)
Trunk is broken
It seems that Trunk is now broken... In Crawl.java line 111 the parameter for parsing is missing. For my self I have added the line: boolean parsing = conf.getBoolean(fetcher.parse, true); and added the param parsing to new Fetcher(conf).fetch(segment, threads, parsing); // fetch it Also the Javadoc build has million errors. Gal
Bug in DeleteDuplicates.java ?
this function throws IOException. Why? public long getPos() throws IOException { return (doc*INDEX_LENGTH)/maxDoc; } It should be throwing ArithmeticException What happens when maxDoc is zero? Gal
java.io.IOException: Job failed
Hi, I am using trunk. while trying to crawl I get the following: in crawl log: 051229 235114 Dedup: adding indexes in: crawl/indexes 051229 235114 parsing file:/home/nutchuser/nutch/conf/nutch-default.xml 051229 235114 parsing file:/home/nutchuser/nutch/conf/crawl-tool.xml 051229 235114 parsing file:/home/nutchuser/nutch/conf/mapred-default.xml 051229 235114 parsing file:/home/nutchuser/nutch/conf/mapred-default.xml 051229 235114 parsing file:/home/nutchuser/nutch/conf/nutch-site.xml 051229 235115 Running job: job_r1bmnj 051229 235116 map 0% 051229 235138 reduce 100% Exception in thread main java.io.IOException: Job failed! at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.jav a:309) at org.apache.nutch.crawl.Crawl.main(Crawl.java:123) in tasktracker log: 050825 100222 task_m_ns3ehv Error running child 050825 100222 task_m_ns3ehv java.lang.ArithmeticException: / by zero 050825 100222 task_m_ns3ehv at org.apache.nutch.indexer.DeleteDuplicates $1.getPos(DeleteDuplicates.java:193) 050825 100222 task_m_ns3ehv at org.apache.nutch.mapred.MapTask $2.next(MapTask.java:102) 050825 100222 task_m_ns3ehv at org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48) 050825 100222 task_m_ns3ehv at org.apache.nutch.mapred.MapTask.run(MapTask.java:116) 050825 100222 task_m_ns3ehv at org.apache.nutch.mapred.TaskTracker $Child.main(TaskTracker.java:604) Regards, Gal
Re: searching return 0 hit
Hi Michael, At least on my side every time I run index, I must stop server and than tomcat and than re start first server than tomcat. I have asked about this twice in this list but nobody answered. I'm not sure it is the same issue, but try it. Regards, Gal. Michael Ji wrote: Somehow, I found my search engine didn't show the result, even I can see the index from LukeAll. ( It works fine before ) I replace ROOT.WAR file in tomcat by nutch's and launch tomcat in nutch's segment directory ( parallel to index subdir ) Should I reinstall Tomcat? Or will that be nutch's indexing issue? My system is running in Linux. thanks, Michael Ji, - 051019 215411 11 query: com 051019 215411 11 searching for 20 raw hits 051019 215411 11 total hits: 0 051019 215449 12 query request from 65.34.213.205 051019 215449 12 query: net 051019 215449 12 searching for 20 raw hits __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ .
[jira] Updated: (NUTCH-100) New plugin urlfilter-db
[ http://issues.apache.org/jira/browse/NUTCH-100?page=all ] Gal Nitzan updated NUTCH-100: - Attachment: urlfilter-db.tar.gz AddedDbURLFilter.patch Fixed some issue with swarm cache (removed loading as daemon). Code cleanup and remarks Added some logging New plugin urlfilter-db --- Key: NUTCH-100 URL: http://issues.apache.org/jira/browse/NUTCH-100 Project: Nutch Type: New Feature Components: fetcher Versions: 0.8-dev Environment: MapRed Reporter: Gal Nitzan Priority: Trivial Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz Hi, I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
Hi Michael, At the moment I have about 3000 domains in my db. I didn't time the performance however having even 100k domains shouldn't have an impact since it is fetched only once from the database to the cache. A little performance hit should be over 100k (depends on number elements defined in xml file). After a few birth problems, the plugin works nicely and I do not feel any impact. Regards, Gal Michael Ji wrote: hi, How is performance concern if the size of domain list reaches 10,000? Micheal Ji, --- Gal Nitzan (JIRA) [EMAIL PROTECTED] wrote: [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ] Gal Nitzan updated NUTCH-100: - type: Improvement (was: New Feature) Description: Hi, I have written a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml was: Hi, I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml Environment: All Nutch versions (was: MapRed) Fixed some issues clean up Added a patch for Subversion New plugin urlfilter-db --- Key: NUTCH-100 URL: http://issues.apache.org/jira/browse/NUTCH-100 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Environment: All Nutch versions Reporter: Gal Nitzan Priority: Trivial Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz Hi, I have written a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ .
[jira] Created: (NUTCH-100) New plugin urlfilter-db
New plugin urlfilter-db --- Key: NUTCH-100 URL: http://issues.apache.org/jira/browse/NUTCH-100 Project: Nutch Type: New Feature Components: fetcher Versions: 0.8-dev Environment: MapRed Reporter: Gal Nitzan Priority: Trivial Hi, I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-100) New plugin urlfilter-db
[ http://issues.apache.org/jira/browse/NUTCH-100?page=all ] Gal Nitzan updated NUTCH-100: - Attachment: urlfilter-db.tar.gz The plugin. Extract, and in myplugin folder read README New plugin urlfilter-db --- Key: NUTCH-100 URL: http://issues.apache.org/jira/browse/NUTCH-100 Project: Nutch Type: New Feature Components: fetcher Versions: 0.8-dev Environment: MapRed Reporter: Gal Nitzan Priority: Trivial Attachments: urlfilter-db.tar.gz Hi, I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira