Tomcat adds file:/// to searcher.dir path
Hello, I have installed nutch-1.2 in Fedora 14 and tomcat6. I added path to crawl dir in searcher.dir property in WEB_INF/classes/nutch-default.xml as /home/user/nutch-1.2/crawl I see in catalina.out file WARN SearchBean - Neither file:///home/user/nutch-1.2/crawl/index nor file:///home/home/nutch-1.2/crawl/indexes found! I think the problem is that Tomcat adds file:// to the searcher.dir , because both folder are there and permission is 777 Any ideas how to fix this issue? Thanks. A.
failed with: java.net.UnknownHostException
Hello I use nutch-1.2 with fedora 14 and try to index about 4000 domains. I use bin/nutch crawl urls -dir crawl -depth 3 topN -1 and have in crawl-urlfilter.txt this # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)* I noticed that if a domain has entered like http://mydomain.com in the seed file, nutch gives error failed with: java.net.UnknownHostException for some domains. If, however, I enter the same domain with www like http://www.mydomain.com nutch does not give any errors. Since, if we enter the http://mydomain.com in the browser it redirects to http://www.mydomain.com I thought this might be a bug in nutch. Any thoughts how to fix this issue? Thanks. Alex.
unnecessary results in search
Hello, I used nutch-1.2 to index a few domains. I noticed that nutch correctly crawled all sub-pages of domains. By sub-pages I mean the followings, for example for a domain mydomain.com all links inside it like mydomain.com/show/photos/1 and etc. I also noticed in our apache logs that google-bot also crawled all sub-pages. However, in search for mydomain.com google gives mydomain.com in the first page and almost no subpages, but nutch gives all subpages. If a domain has, let say 200 sub-pages and we display 10 results in a page then it would take us 10 pages to go forward to see results from other domains. In contrary google displays results form ohter domains in the second place. Is there a way of fixing this issue? Thanks in advance. Alex.
Re: unnecessary results in search
Hello, Thanks you for your response. Let me give you more detail of the issue that I have. First definitions. Let say I have my own domain that I host on a dedicated server and call it mydomain.com Next, call subdomain the followings answers.mydomain.com, mail.mydomain.com, maps.mydomain.com and etc. Call subpages the followings mydomain.com/show/photos/1, mydomain.com/forum/id/5 and etc. Having these definitions, I have observed by examinig apache log files that Google and Nutch crawlers crawled all subpages of mydomain.com However, if we search in google for keyword mydomain.com it gives in results all subdomains of mydomain.com not all subpages, maybe some of them. If we search in Nutch for the keyword mydomain.com it gives all subdomains and subpages. My concern was not to include all subpages in a search for keyword mydomain.com. Of course, we must see subpages for keywords that is in that subpage. This means we must not remove subpages from index. I hope this gives you more detail of the issue that I have. Thanks. Alex. -Original Message- From: Gora Mohanty g...@mimirtech.com To: user user@nutch.apache.org Sent: Tue, Jan 4, 2011 3:28 am Subject: Re: unnecessary results in search On Tue, Jan 4, 2011 at 5:40 AM, alx...@aim.com wrote: Hello, I used nutch-1.2 to index a few domains. I noticed that nutch correctly crawled all sub-pages of domains. By sub-pages I mean the followings, for example for a domain mydomain.com all links inside it like mydomain.com/show/photos/1 and etc. I also noticed in our apache logs that google-bot also crawled all sub-pages. However, in search for mydomain.com google gives mydomain.com in the first page and almost no subpages, but nutch gives all subpages. If a domain has, let say 200 sub-pages and we display 10 results in a page then it would take us 10 pages to go forward to see results from other domains. In contrary google displays results form ohter domains in the second place. [...] It is not entirely clear what you want: * If your goal is to only crawl to a certain depth on a domain, you can use the -depth argument for the Nutch crawl, or use the -topN option to specify the max. number of pages to retrieve. * Can you give an actual example of what you are searching for. It is difficult to understand your description above. E.g., searching Google for yahoo.com returns many, many links from yahoo.com. * If you mean that a search with any query string returns different results between Google, and Nutch, that could be due to many reasons. In both cases, the returned pages are ranked by relevancy, but the algorithm is different. Also, Google has probably indexed many more sites than your Nutch crawl. Regards, Gora
Re: Exception on segment merging
Which command did you use? Merging segments is very expensive in resources, so I try to avoid merging them. -Original Message- From: Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com To: user user@nutch.apache.org Sent: Tue, Jan 4, 2011 7:12 am Subject: FW: Exception on segment merging I see in hadup log and some more details about the exception are there. Please help me what to check for this error. Here are the details: 2011-01-04 07:40:23,999 INFO segment.SegmentMerger - Slice size: 5 URLs. 2011-01-04 07:40:36,563 INFO segment.SegmentMerger - Slice size: 5 URLs. 2011-01-04 07:40:36,563 INFO segment.SegmentMerger - Slice size: 5 URLs. 2011-01-04 07:40:43,685 INFO segment.SegmentMerger - Slice size: 5 URLs. 2011-01-04 07:40:43,686 INFO segment.SegmentMerger - Slice size: 5 URLs. 2011-01-04 07:40:47,316 WARN mapred.LocalJobRunner - job_local_0001 java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1 044) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.Text.writeString(Text.java:412) at org.apache.nutch.metadata.Metadata.write(Metadata.java:220) at org.apache.nutch.protocol.Content.write(Content.java:170) at org.apache.hadoop.io.GenericWritable.write(GenericWritable.java:135) at org.apache.nutch.metadata.MetaWrapper.write(MetaWrapper.java:107) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.ser ialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.ser ialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:900) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466 ) at org.apache.nutch.segment.SegmentMerger.map(SegmentMerger.java:361) at org.apache.nutch.segment.SegmentMerger.map(SegmentMerger.java:113) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_local_0001/attempt_local_0001_m_32_0/output/spi ll0.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWr ite(LocalDirAllocator.java:343) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocato r.java:124) at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.ja va:107) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1 221) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:68 6) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.jav a:1173) -Original Message- From: Marseld Dedgjonaj [mailto:marseld.dedgjo...@ikubinfo.com] Sent: Tuesday, January 04, 2011 1:28 PM To: user@nutch.apache.org Subject: Exception on segment merging Hello everybody, I have configured nutch-1.2 to crawl all urls of a specific website. It runs fine for a while but now that the number of indexed urls has grown more than 30'000, I got an exception on segment merging. Have anybody seen this kind of error. The exception is shown below. Slice size: 5 URLs. Slice size: 5 URLs. Slice size: 5 URLs. Slice size: 5 URLs. Slice size: 5 URLs. Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:638) at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:683) Merge Segments- End at: 04-01-2011 07:40:48 Thanks in advance Best Regards, Marseldi p class=MsoNormalspan style=color: rgb(31, 73, 125);Gjeni bPuneuml; teuml; Mireuml;/b dhe bteuml; Mireuml; peuml;r Puneuml;/b... Vizitoni: a target=_blank href=http://www.punaime.al/;www.punaime.al/a/span/p pa target=_blank href=http://www.punaime.al/;span style=text-decoration: none;img width=165 height=31 border=0 alt=punaime src=http://www.ikub.al/images/punaime.al_small.png; //span/a/p
Re: unnecessary results in search
One more thing I just noticed is that Nutch search results do not display information from meta tag. Google and yahoo does. In more details, Nutch search results for keyword mydomain.com displays some short text from page mydomain.com. In contrary, google and yahoo search results for the same keyword display words from meta tag. How this can be fixed in Nutch? Thanks. Alex. -Original Message- From: Gora Mohanty g...@mimirtech.com To: user user@nutch.apache.org Sent: Wed, Jan 5, 2011 10:20 am Subject: Re: unnecessary results in search On Wed, Jan 5, 2011 at 11:25 PM, alx...@aim.com wrote: I do search directly in Nutch version 1-2. I think google gives very low scores to subpages of a domain and higher scores to other domains for a given keyword. That is possible, though I am not sure why the situation is different with non-popular domains. This must be so because if mydomain.com has let say 2000 subpages then in the search result for keyword mydomain.com the next 200 pages all will be subpages of mydomain.com. If someone could direct me to the part of the source code where Nutch gives scores to pages I can take a look to it. If you are using Nutch for search also, I am afraid that someone else will have to help you. I have no experience there. Regards, Gora
Re: unnecessary results in search
Hello, Just noticed that google actually has results from all subpages of mydomain.com for keyword mydomain.com but they are hidden in a link show more results from mydomain.com. Is there a way of putting more results from the same domain in such a link in Nutch rss feed, since I use opensearch to display results from nutch. Thanks. Alex. -Original Message- From: Gora Mohanty g...@mimirtech.com To: user user@nutch.apache.org Sent: Wed, Jan 5, 2011 10:20 am Subject: Re: unnecessary results in search On Wed, Jan 5, 2011 at 11:25 PM, alx...@aim.com wrote: I do search directly in Nutch version 1-2. I think google gives very low scores to subpages of a domain and higher scores to other domains for a given keyword. That is possible, though I am not sure why the situation is different with non-popular domains. This must be so because if mydomain.com has let say 2000 subpages then in the search result for keyword mydomain.com the next 200 pages all will be subpages of mydomain.com. If someone could direct me to the part of the source code where Nutch gives scores to pages I can take a look to it. If you are using Nutch for search also, I am afraid that someone else will have to help you. I have no experience there. Regards, Gora
Re: Few questions from a newbie
you can put fetch external and internal links to false and increase depth. -Original Message- From: Churchill Nanje Mambe mambena...@afrovisiongroup.com To: user user@nutch.apache.org Sent: Wed, Jan 26, 2011 8:03 am Subject: Re: Few questions from a newbie even if the url being crawled is shortened, it will still lead nutch to the actual link and nutch will fetch it
nutch crawl command takes 98% of cpu
Hello, I run crawl command with -depth 7 -topN -1 on my linux box with 1.5Mps internet, amd 3.1ghz processor, 4GB memory, Fedora Linux 14, nutch 1.2. After 1-2 days nutch takes 98% of cpu. My seed file includes about 3500 domains and I put fetch.external links to false. Is this normal? If not, what can be done to improve it? Thanks. Alex.
Re: Nutch search result
2nd, after testing to fetch several pages from wikipedia, the search query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache ../wiki_dir returns It returns a result for keyword apache because that url has apache in it. -topN 50), it actually fetches some pages e.g. `fetching http://www.plurk.com/t/Brazil'). I am confused the differences between using crawl command and step-by-step crawling. crawling with crawl command (bin/nutch crawl urls -dir crawl -depth 3 In order to get the same fetching in step by step approach you need to do fetching 3 times because you have depth 3 in crawl command -Original Message- From: Thomas Anderson t.dt.aander...@gmail.com To: user user@nutch.apache.org Sent: Fri, Feb 18, 2011 9:10 pm Subject: Re: Nutch search result The version used is nutch 1.1. OS is debian testing. Java version is 1.6.0_23. The first question raises from when testing to fetch plurk.com. The url specified at the inject stage only contains e.g. http://plurk.com. After going through the steps described in the tutorial, I notice no `fetching http:// ... ' key words were displayed on console. But when crawling with crawl command (bin/nutch crawl urls -dir crawl -depth 3 -topN 50), it actually fetches some pages e.g. `fetching http://www.plurk.com/t/Brazil'). I am confused the differences between using crawl command and step-by-step crawling. When fetching wikipedia, the url specified is http://en.wikipedia.org. No ibm related url exists. But the file containing wiki url is resided under wiki folder where also stores crawldb, segments, etc. Thanks for help. On Fri, Feb 18, 2011 at 7:27 PM, McGibbney, Lewis John lewis.mcgibb...@gcu.ac.uk wrote: Hi Thomas Firstly which dist are you using? ___ From: Thomas Anderson [t.dt.aander...@gmail.com] Sent: 18 February 2011 10:11 To: user@nutch.apache.org Subject: Nutch search result I follow the NutchTutorial and get the search worked, but I have several questions. 1st, is it possible for a website to setup some restriction so that nutch can not fetch its pages or the pages fetched is limited under some condition? If so, what file (e.g. robots.txt?) nutch would respect in order to avoid fetching specific pages? For this can you please specify your use scenario. If You hve a website, with certain areas, which you wish not to be crawled then I would assume a robots file would suffice. Inversely, if you wish to restict Nuch from crawling certain pages of specific domains then I imagine you would be looking at a different config of crawl-urlfilter 2nd, after testing to fetch several pages from wikipedia, the search query with e.g. bin/nutch org.apache.nutch.searcher.NutchBean apache ../wiki_dir returns Total hits: 1 0 20110218171640/http://en.wikipedia.org/wiki/IBM IBM - Wikipedia, the free encyclopedia IBM From Wikipedia, the free encyclopedia Jump to: ... I'm afraid that I completely loose you here. Have you specified some IBM page within your /wiki_dir ? If so, it might be the case that Nutch has not fetched pages for a certain reason E.g. politeness rules. Can anyone advise on this please? This seeming does not relate to apache, any reason that may explain the reason it returns IBM? Or any execution step below may go wrong? bin/nutch inject ../wiki/crawldb urls bin/nutch generate ../wiki/crawldb ../wiki/segments bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1` bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1` bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100 bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1` bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1` bin/nutch generate ../wiki/crawldb ../wiki/segments -topN 100 bin/nutch fetch `ls -d ../wiki/segments/2* | tail -1` bin/nutch updatedb ../wiki/crawldb `ls -d ../wiki/segments/2* | tail -1` bin/nutch invertlinks ../wiki/linkdb -dir ../wiki/segments bin/nutch index ../wiki/indexes ../wiki/crawldb ../wiki/linkdb ../wiki/segments/* In addition, why only the third round 'generate, fetch, and updatedb' will actually fetch pages while the second round only replies it is done? The second round message Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting Fetcher: segment: ../wiki/segments/20110218171338 ^[OFFetcher: threads: 10 QueueFeeder finished: total 1 records + hit by time limit :0 -finishing thread FetcherThread, activeThreads=1 fetching http://en.wikipedia.org/wiki/Main_Page -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread,
Re: Starting web frontend
Hello, I wondered if there is a way of adding to solrindex made from nutch segments another solrindex also made from nutch segments. I have to index about 3000 domains but 5 of them are newspaper sites. So, I need to crawl-fetch-parse these 5 domains(with depth 2) and update index every day or so. The rest is crawled and indexed once a month. Thanks. Alex. -Original Message- From: Markus Jelsma markus.jel...@openindex.io To: Jeremy Arnold jer...@possiblyfaulty.com Cc: user user@nutch.apache.org Sent: Thu, Feb 24, 2011 3:46 pm Subject: Re: Starting web frontend Thanks for the reply Mark. So this means Nutch is really only going to be used for crawling now? Are there any plans for a JSON/XML RPC interface to using Nutch like Solr supports? Yes, Nutch is going to focus to the fetch and parse jobs. Andrzej was working on a REST interface to control these jobs. This is part of 2.0. I am interested in a tight app integration where I can easily start crawls of new sites, and add/remove things from the index quickly. I guess I can rely directly on Solr for adding/removing from the index as well, or would you recommend this going through nutch? Removing items from the index can be forced from Solr and Nutch. Solr provides easy methods to remove documents or documents that are the result of some query. Nutch can deduplicate (1.2+ and 2.0) and possibly remove 404 pages (1.3 and 2.0) but the latter is not committed. Thanks, Jeremy On Thu, Feb 24, 2011 at 12:23 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi Jeremy, Nutch' own search server is in the process of being deprecated, Nutch 1.2 was the last release to provide the search server. Please consider using Apache Solr as your search server. Cheers, I recently installed Nutch and have spent some time trying to get it working with limited success. ./nutch crawl urls -dir crawl -depth 5 -topN 50 After the crawl completes I am trying to run the web frontend with the following command: ./nutch server 8080 crawl The server seems to be running (no output on the command line), but when I hit localhost:8080 I get a Error 324 (net::ERR_EMPTY_RESPONSE): Unknown error. Any ideas on how to get past this? I've been using this tutorial to get started. http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine Thanks, Jeremy
Re: Reload index without restart tomcat.
That tutorial is applicable for the new version too. -Original Message- From: Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com To: user user@nutch.apache.org; 'McGibbney, Lewis John' lewis.mcgibb...@gcu.ac.uk Sent: Tue, Mar 8, 2011 5:25 am Subject: RE: Reload index without restart tomcat. Hi Lewis, Thanks for your help. I tried to find any tutorial to integrate Nutch-1.2 with Solr but all what I found are for old versions of nutch(nutch-1.0). Please if you have any tutorial for nutch 1.2, send it to me. Regards, Marseldi -Original Message- From: McGibbney, Lewis John [mailto:lewis.mcgibb...@gcu.ac.uk] Sent: Monday, March 07, 2011 6:55 PM To: user@nutch.apache.org Subject: RE: Reload index without restart tomcat. Hi Marseld, You need to configure it and can be done in a number of ways (assuming you are using Nutch-1.2) 1) Individual commands when attempting a whole web crawl, the solrindex option should be used to pass indexing to Solr 2) pass -solr http://blahblahblah as a parameter when using crawl command Obviously there are a number of issues such as a suitable Solr schema for field matching but you should be able to find most info on this by combining posts from both Nutch and Solr wiki respectively. Hope this helps Lewis From: Marseld Dedgjonaj [marseld.dedgjo...@ikubinfo.com] Sent: 07 March 2011 17:53 To: user@nutch.apache.org Subject: Reload index without restart tomcat. Hello Everybody, I am trying to reload the index without restart of the tomcat. I see in other topics that this is not possible in nutch without solr. I am using nutch-1.2. Is solr included as indexer by default or should I configure it? Regards, Marseld p class=MsoNormalspan style=color: rgb(31, 73, 125);Gjeni bPuneuml; teuml; Mireuml;/b dhe bteuml; Mireuml; peuml;r Puneuml;/b... Vizitoni: a target=_blank href=http://www.punaime.al/;www.punaime.al/a/span/p pa target=_blank href=http://www.punaime.al/;span style=text-decoration: none;img width=165 height=31 border=0 alt=punaime src=http://www.ikub.al/images/punaime.al_small.png; //span/a/p Email has been scanned for viruses by Altman Technologies' email management service - www.altman.co.uk/emailsystems Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html p class=MsoNormalspan style=color: rgb(31, 73, 125);Gjeni bPuneuml; teuml; Mireuml;/b dhe bteuml; Mireuml; peuml;r Puneuml;/b... Vizitoni: a target=_blank href=http://www.punaime.al/;www.punaime.al/a/span/p pa target=_blank href=http://www.punaime.al/;span style=text-decoration: none;img width=165 height=31 border=0 alt=punaime src=http://www.ikub.al/images/punaime.al_small.png; //span/a/p
will nutch-2 be able to index image files
Hello, I wondered if nutch version 2 be able to index image files? Thanks. Alex.
Re: will nutch-2 be able to index image files
I meant to extract image title, src link and alt from img tags and not store image files. For a keyword search in must display link, which automatically displays image itself in the search page. Not sure what do you mean image content-based retrieval? Do image files have tags like mp3 ones? Must a parse plugin be written in both cases? Thanks. Alex. -Original Message- From: Andrzej Bialecki a...@getopt.org To: user user@nutch.apache.org Sent: Tue, Mar 8, 2011 12:58 pm Subject: Re: will nutch-2 be able to index image files On 3/8/11 9:09 PM, alx...@aim.com wrote: Hello, I wondered if nutch version 2 be able to index image files? In what way? Extract metadata and index image metadata as text? Sure, if we implement a plugin for it. Tika already supports EXIF, so this shouldn't be complicated, perhaps it's a tweak to the parse-tika configuration. Or did you mean the image content-based retrieval (e.g. using wavelets)? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: nutch crawl command takes 98% of cpu
Hello, Which version this patch is applicable? Thanks. Alex. -Original Message- From: Alexis alexis.detregl...@gmail.com To: user user@nutch.apache.org Sent: Tue, Feb 8, 2011 9:59 am Subject: Re: nutch crawl command takes 98% of cpu Hi, Thanks for all the feedback. It looks like there is not much you can do if you give the FLV parser some corrupted data. From a practical point of view, we can say that this is extremely annoying as it takes up all the CPU resources and prevent other threads to perform their task properly, till the TIMEOUT occurs, kills the thread and frees up the CPU. We can notice that this happens when an FLV file is truncated (due to an http.content.limit property lower that its content-length, in bytes). So the suggestion is to hint to the parser that it is likely to get stuck and skip the parsing in case the downloaded content size mismatches the content-length header. Besides, I often see errors in the HTML parser when the content is truncated (https://issues.apache.org/jira/browse/TIKA-307). So it does not hurt saving time and avoiding errors. I created the issue here: https://issues.apache.org/jira/browse/NUTCH-965 See attached patch. Alexis. On Mon, Feb 7, 2011 at 12:00 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi Kirby others, On Jan 31, 2011, at 4:39pm, Kirby Bohling wrote: On Sat, Jan 29, 2011 at 9:03 AM, Ken Krugler kkrugler_li...@transpac.com wrote: Some comments below. On Jan 29, 2011, at 5:55am, Julien Nioche wrote: Hi, This shows the state of the various threads within a Java process. Most of them seem to be busy parsing zip archives with Tika. The interesting part is that the main thread is at the Generation step : * at org.apache.nutch.crawl.Generator.generate(Generator.java:431) at org.apache.nutch.crawl.Crawl.main(Crawl.java:127) * with the Thread-415331 normalizing the URLs as part of the generation. So why do we see threads busy at parsing these archives? I think this is a result of the Timeout mechanism ( https://issues.apache.org/jira/browse/NUTCH-696) used for the parsing. Before it, we used to have the parsing step loop on a single document and never complete. Thanks to Andrzej's patch, the parsing is done is separate threads which are abandonned if more than X seconds have passed (default 30 I think). Obiously these threads are still lurking around in the background and consuming CPU. This is an issue when calling the Crawl command only. When using the separate commands for the various steps, the runaway threads die with the main process, however since the Crawl uses a single process, these timeout threads keep going. Am not an expert in multithreading and don't have an idea of whether these threads could be killed somehow. Andrzej, any clue? This is a fundamental problem with run-away threads - there is no safe, reliable way to kill them off. And if you parse enough documents, you will run into a number that currently cause Tika to hang. Zip files for sure, but we ran into the same issue with FLV files. Over in Tika-land, Jukka has a patch that fires up a child JVM and runs parsers there. See https://issues.apache.org/jira/browse/TIKA-416 -- Ken All, Just an observation, but the general approach to this problem is to use Thread.interrupt(). Virtually all code in the JDK treats the thread being interrupted as a request to cancel. Java Concurrency in Practice (JCIP) has a whole chapter on this topic (Chapter 7). IMHO, any general purpose library code that swallows InterruptedException and isn't implementing the Thread cancellation policy has a bug in it (the cancellation policy can only be implemented by the owner of the thread, unless the library is a task/thread library it cannot be implementing the cancellation policy). Any place you see: [snip] One exception is that sockets read/write operations don't operate this way, the socket must be closed to interrupt a read/write, the approach JCIP suggests is to tie the socket and thread in such a way that interrupt() closes the sockets that would be reading/writing inside that thread. Excellent input, as I need to solve some issues with needing to abort HTTP requests. [snip] Not sure exactly what the problems inside of Tika are, but getting it to respect interruption would be a wonderful thing for everybody that uses it. The problem might be getting all underlying libraries it uses to do so. Yes, that's exactly the issue in the cases I've seen. The libraries used to do the actual parsing can get caught in loops, when processing unexpected data. There's no checks for interrupt, e.g. it's code that is walking some data structure, and doesn't realize that it's in a loop (e.g. offset to next chunk is set to zero, so the same chunk is endlessly
skip Urls regex
Hello I see in nutch-1.2/conf/regex-urlfilter.txt file the following lines # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ However, nutch fetch urls like http://www.example.com/text/dev/faq/dev/content/2305/dev/content/246/ Thanks. Alex.
Re: Problem with Gora dependencies in trunk
Hi, If you donwload gora and build it with ant you get rid of the one of the dependency --unresolved dependency: org.apache.gora#gora-core;0.1: not found if you change gora version from 1.0 to 1.0-incubator in one of the ivy files but this one --unresolved dependency: org.apache.gora#gora-sql;0.1: not found stays because gora itself does not build successfuly. It has also some other dependencies that I was unable to locate yet. Good luck, Alex. -Original Message- From: McGibbney, Lewis John lewis.mcgibb...@gcu.ac.uk To: user user@nutch.apache.org Sent: Thu, Mar 17, 2011 3:12 pm Subject: Problem with Gora dependencies in trunk Hi list, OK I have seen quite a few threads on this topic as well as a couple of comments appended to the blog entries provided on the wiki. I also posted on this a while back but unfortunately got no reply so thought best thing to do was persist and see if I could solve the issue... how wrong I was. I have followed in minute detail the building nutch 2.0 in eclipse blog entry I'm getting the following after attempting to add Ivy library to ivy/ivy.xml Impossible to resolve dependencies of org.apache.nutch#${ant.project.name};working@lewis-01 unresolved dependency: org.apache.gora#gora-core;0.1: not found unresolved dependency: org.apache.gora#gora-sql;0.1: not found unresolved dependency: org.apache.gora#gora-core;0.1: not found unresolved dependency: org.apache.gora#gora-sql;0.1: not found unresolved dependency: org.apache.gora#gora-core;0.1: not found unresolved dependency: org.apache.gora#gora-sql;0.1: not found I read here http://www.mail-archive.com/user@nutch.apache.org/msg01515.html that there WAS a problem with Nutch wrongly assuming Gora artifacts, but that it has since been resolve so I am really stumped. Any comments would be appreciated. Thank you Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
Re: Problem with Gora dependencies in trunk
Hi, Did you build gora with ant? I checked out from svn a few days ago and ant for gora gives error :: [ivy:resolve] :: UNRESOLVED DEPENDENCIES :: [ivy:resolve] :: [ivy:resolve] :: com.sun.jersey#jersey-core;1.4: not found [ivy:resolve] :: com.sun.jersey#jersey-json;1.4: not found [ivy:resolve] :: com.sun.jersey#jersey-server;1.4: not found Thanks. Alex. -Original Message- From: Markus Jelsma markus.jel...@openindex.io To: user user@nutch.apache.org Cc: McGibbney, Lewis John lewis.mcgibb...@gcu.ac.uk Sent: Thu, Mar 17, 2011 3:21 pm Subject: Re: Problem with Gora dependencies in trunk I had issues as well some while ago but i updated to the latest trunk revision a few weeks ago. I first built Gora's checkout and after that ant was doing well with Nutch. No need to change Ivy anymore. Hi list, OK I have seen quite a few threads on this topic as well as a couple of comments appended to the blog entries provided on the wiki. I also posted on this a while back but unfortunately got no reply so thought best thing to do was persist and see if I could solve the issue... how wrong I was. I have followed in minute detail the building nutch 2.0 in eclipse blog entry I'm getting the following after attempting to add Ivy library to ivy/ivy.xml Impossible to resolve dependencies of org.apache.nutch#${ant.project.name};working@lewis-01 unresolved dependency: org.apache.gora#gora-core;0.1: not found unresolved dependency: org.apache.gora#gora-sql;0.1: not found unresolved dependency: org.apache.gora#gora-core;0.1: not found unresolved dependency: org.apache.gora#gora-sql;0.1: not found unresolved dependency: org.apache.gora#gora-core;0.1: not found unresolved dependency: org.apache.gora#gora-sql;0.1: not found I read here http://www.mail-archive.com/user@nutch.apache.org/msg01515.html that there WAS a problem with Nutch wrongly assuming Gora artifacts, but that it has since been resolve so I am really stumped. Any comments would be appreciated. Thank you Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219, en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691 ,en.html
Re: Script failing when arriving at 'Solr' commands
It seems to me that you may have the same problem as before with the disk space. This may happen because you do mergesegs. Try not to merge segments. Alex. -Original Message- From: McGibbney, Lewis John lewis.mcgibb...@gcu.ac.uk To: user user@nutch.apache.org Sent: Wed, Apr 6, 2011 12:55 pm Subject: Script failing when arriving at 'Solr' commands Hi list, The last week has been a real hang up and I have made very little progress so excuse this lengthy post. Using branch-1.3. My script contains following commands 1.inject 2.generate fetch parse updatedb 3.mergesegs 4.inverlinks 5.solrindex 6.solrdedup 7.solrclean 8.load new index The script is running fine until solrindex stage and this output LinkDb: starting at 2011-04-06 20:25:40 LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/segments/20110406202533 LinkDb: merging with existing linkdb: crawl/linkdb LinkDb: finished at 2011-04-06 20:25:44, elapsed: 00:00:03 - SolrIndex (Step 5 of 8) - SolrIndexer: starting at 2011-04-06 20:25:45 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/crawl_fetch Input path does not exist: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/crawl_parse Input path does not exist: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/parse_data Input path does not exist: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/linkdb/parse_text Input path does not exist: file:/home/lewis/workspace/branch-1.3/runtime/local/crawl/NEWindexes/current - SolrDedup (Step 6 of 8) - Usage: SolrDeleteDuplicates solr url - SolrClean (Step 7 of 8) - SolrClean: starting at 2011-04-06 20:25:47 Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:169) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.indexer.solr.SolrClean.delete(SolrClean.java:168) at org.apache.nutch.indexer.solr.SolrClean.run(SolrClean.java:180) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.solr.SolrClean.main(SolrClean.java:186) Having inspected the linkdb I can see a directory named 'current', which in turn contains a 'part-0' directory which contains two files named 'data' and 'index'... as far as I am aware this is identical when I used Nutch-1.2. A couple of points to note about my recent discovery's and thought's 1. I was having problems with script with a similar 'Input path does not exist' error until I added the hadoop.tmp.dir property as a HDD partition to nutch-site, this seemed to solve the problem. 2. I am aware that it is maybe not necessary (and possibly not best practice for some situations) to include an invertlinks command prior to indexing, however this has always been my practice and has always provided great results when I was using the legacy Lucene indexing within Nutch-1.2, therefore I am curious to understand if it is this command which is knocking off the solrindexer 3.Is it a possibility that there is a similar property such as solr.tmp.dir I need to set which I am missing and this is knocking solrindexer off? 4. Even after solrindexer kicks in, the solrdedup output does not appear to be responding correctly, this is shadowed by solrclean so I am definitely doing something wrong here, however I am unfamiliar with the IOException No FileSystem for scheme: http. I understand that this post may seem a bit epic, but from the information I have E.g. logs, terminal output and user-lists I am stumped. I'm therefore looking for guys with more experience to possibly lend a hand. I can provide additional command parameters if this is of value. Thanks in advance for any help Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation
Re: will nutch-2 be able to index image files
Hello, Looks like I will have some spare time in the next month, so I may work on writing this image indexing plugin. I wondered if there is a similar plugin to leverage code from or follow it? Thanks. Alex. -Original Message- From: Andrzej Bialecki a...@getopt.org To: user user@nutch.apache.org Sent: Wed, Mar 9, 2011 12:24 am Subject: Re: will nutch-2 be able to index image files On 3/8/11 10:50 PM, alx...@aim.com wrote: I meant to extract image title, src link and alt fromimg tags and not store image files. For a keyword search in must display link, which automatically displays image itself in the search page. Not sure what do you mean image content-based retrieval? Do image files have tags like mp3 ones? Yes, for example http://en.wikipedia.org/wiki/Exchangeable_image_file_format Must a parse plugin be written in both cases? Yes - most data is already available either in the DOM tree, or can be obtained from a Tika image parser, it just needs to be wrapped in a plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Hosts File Nutch 1.0+
It seems you should move www.example.com example.com from line 3 to line 1, uncomment line 3 and comment other lines. Alex. -Original Message- From: Alex alex.thegr...@ambix.net To: user user@nutch.apache.org Sent: Tue, Apr 26, 2011 4:18 am Subject: Re: Hosts File Nutch 1.0+ Just in case someone has more ideas. Here is how my hosts file look like: http://pastebin.com/wyV7wnqn Any help is highly appreciated! Alex On Apr 25, 2011, at 10:13 PM, Alex wrote: Dear Mark: Thank you so much for the help! I tried it but it still give me the same error. According to the developer is either a server environment for not able to search itself or host file issue. Any other ideas? Thank you so much for your time! Alex On Apr 19, 2011, at 6:01 PM, Mark Achee wrote: With nslookup already showing the correct IP address, it doesn't seem like a hostname/DNS issue. But I assume this is what the developer is talking about: At the end of your /etc/hosts file add 127.0.0.1 www.example.org but replace www.example.org with your domain. If you know what the server's other IP address(es) is/are, you could try those also instead of 127.0.0.1. If that doesn't fix it, it's probably not really a hostname/DNS issue. -Mark On Tue, Apr 19, 2011 at 6:47 PM, Alex alex.thegr...@ambix.net wrote: I edited that so that it does not disclose the location of my rootUrLDir. The path is accurate. I am going to find out what command is given to nutch but basically the application developer has confirmed that the issue is the hosts file or something on the server that can not search itself. Alex On Apr 19, 2011, at 5:22 PM, Mark Achee wrote: From your logs: INFO sitesearch.CrawlerUtil: rootUrlDir = /path/to/directory/ Looks like you didn't set the seed urls directory. If that's not enough info for you to fix it, send the full command you're running. -Mark On Thu, Apr 14, 2011 at 10:57 PM, Alex alex.thegr...@ambix.net wrote: Hi, I am new to Nutch. I have an application that uses Nutch to search. I have configured the application so that Nutch can run. However, after a lot of troubleshooting I have been pointed to the fact that there is something wrong with my hosts file. My hostname is different than my domain name and that seems to make Nutch stop in depth 1. Does anyone have any idea of what is the correct configuration of the hosts file so that nutch runs properly? My domain name resolves fine. Please help me! Here are the logs of the indexing: Stopping at depth=1 - no more URLs to fetch. INFO sitesearch.CrawlerUtil: indexHost : Starting an Site Search index on host www.mydomain.com INFO sitesearch.CrawlerUtil: site search crawl started in: /opt/ dotcms/ dotCMS/assets/search_index/www.mydomain.com/1-XXX_temp/crawl-index ] INFO sitesearch.CrawlerUtil: rootUrlDir = /path/to/directory/ search_index/www.mydomain.com/url_folder INFO sitesearch.CrawlerUtil: threads = 10 INFO sitesearch.CrawlerUtil: depth = 20 INFO sitesearch.CrawlerUtil: indexer=lucene INFO sitesearch.CrawlerUtil: Stopping at depth=1 - no more URLs to fetch. NFO sitesearch.CrawlerUtil: site search crawl finished: / directorypath/ search_index/www.mydomain.com/1xxx/crawl-index INFO sitesearch.CrawlerUtil: indexHost : Finished Site Search index on host www.mydomain.com
keeping index up to date
Hello, I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf files which do not change over time. I wondered if there is a way of configuring nutch not to fetch unchanged documents again and again, but keep the old index for them. Thanks. Alex.
Re: keeping index up to date
Hi, I took a look to the recrawl script and noticed that all the steps except urls injection are repeated at the consequent indexing and wondered why would we generate new segments? Is it possible to do fetch, update for all previous $s1..$sn , invertlink and index steps. Thanks. Alex. -Original Message- From: Julien Nioche lists.digitalpeb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Jun 1, 2011 12:59 am Subject: Re: keeping index up to date You should use the adaptative fetch schedule. See http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20for details On 1 June 2011 07:18, alx...@aim.com wrote: Hello, I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf files which do not change over time. I wondered if there is a way of configuring nutch not to fetch unchanged documents again and again, but keep the old index for them. Thanks. Alex. -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
ranking of search results
Hello, I use nutch 1.2 and solr to index about 3500 domains. I noticed that search results for two or more keywords are not ranked properly. For example for keyword Lady Gaga some results that has Lady are displayed first then some results with both keywords and etc. It seems to me that results with both words must be displayed in the first place and those with one of the keywords must follow them. Any idea how to correct this. Thanks. Alex.
Re: keeping index up to date
Hello, One more question. Is there a way of adding new urls to crawldb created in previous crawls to include in subsequent recrawls? Thanks. Alex. -Original Message- From: lewis john mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org; markus.jelsma markus.jel...@openindex.io Sent: Tue, Jun 7, 2011 1:16 pm Subject: Re: keeping index up to date Hi, To add to Markus' comments, if you take a look at the script it is written in such a way that if run in safe mode it protects us against an error which may occur. If this is the case we an recover segments etc and take appropriate actions to resolve. On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, I took a look to the recrawl script and noticed that all the steps except urls injection are repeated at the consequent indexing and wondered why would we generate new segments? Is it possible to do fetch, update for all previous $s1..$sn , invertlink and index steps. No, the generater generates a segment with a list of URL for the fetcher to fetch. You can, if you like, then merge segments. Thanks. Alex. -Original Message- From: Julien Nioche lists.digitalpeb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Jun 1, 2011 12:59 am Subject: Re: keeping index up to date You should use the adaptative fetch schedule. See http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20 for details On 1 June 2011 07:18, alx...@aim.com wrote: Hello, I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf files which do not change over time. I wondered if there is a way of configuring nutch not to fetch unchanged documents again and again, but keep the old index for them. Thanks. Alex. -- *Lewis*
Re: solrindex command` not working
check for errors in solr log. -Original Message- From: Way Cool way1.wayc...@gmail.com To: user user@nutch.apache.org Sent: Tue, Jul 26, 2011 3:14 pm Subject: Re: solrindex command` not working The latest solr version is 3.3. Maybe you can try that. On Tue, Jul 26, 2011 at 2:10 AM, Marseld Dedgjonaj marseld.dedgjo...@ikubinfo.com wrote: Hello list, I am having trouble using nutch + solr in index step. I use nutch 1.2 and solr 1.3. When I execute command: $NUTCH_HOME/bin/nutch solrindex http://127.0.0.1:8983/solr/main/ $crawldir/crawldb $crawldir/linkdb $crawldir/segments/* I got 2011-07-25 20:11:59,702 ERROR solr.SolrIndexer - java.io.IOException: Job failed! and no index passed to the SOLR. Any Idea what I am doing wrong. Thanks in advance, Marseld p class=MsoNormalspan style=color: rgb(31, 73, 125);Gjeni bPuneuml; teuml; Mireuml;/b dhe bteuml; Mireuml; peuml;r Puneuml;/b... Vizitoni: a target=_blank href=http://www.punaime.al/ www.punaime.al/a/span/p pa target=_blank href=http://www.punaime.al/;span style=text-decoration: none;img width=165 height=31 border=0 alt=punaime src=http://www.ikub.al/images/punaime.al_small.png; //span/a/p
ranking in nutch/solr results
Hello, I use nutch-1.2 with solr 1.4. Recently, I noticed that for search for a domain name, for example yahoo.com, yahoo.com is not in the first place. Instead other sites that has in content yahoo.com, are in the first places. I tested this issue with google. In its results domain is in the first place. Any idea how to fix this in Nutch/Solr results. Thanks in advance. Alex.
Re: nutch redirect treatment
https://issues.apache.org/jira/browse/NUTCH-1044 -Original Message- From: abhayd ajdabhol...@hotmail.com To: nutch-user nutch-u...@lucene.apache.org Sent: Wed, Aug 17, 2011 11:44 am Subject: nutch redirect treatment hi I have seen similar posts in this forum but still not able to understand how redirect is handled.. I m trying to crawl http://developer.att.com/developer/ . After successful crawl i dump the crawldb using readdb. I see entries like following. What does this mean? Has nutch crawled the redirected page and is it in index? I tried using readseg command with all the segments under crawl/segments directory but i could not find http://developer.att.com/developer/tier1page.jsp?passedItemId=16_requestid=35037 url. heres is my crawl/segments directory listing. 20110817001833 20110817002117 20110817003028 20110817003930 20110817004202 20110817001844 20110817002556 20110817003532 20110817004105 Any help why redirected page is not crawled? http://developer.att.com/developer/ Version: 7 Status: 4 (db_redir_temp) Fetch time: Fri Sep 16 00:18:36 CDT 2011 Modified time: Wed Dec 31 18:00:00 CST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _pst_: temp_moved(13), lastModified=0: http://developer.att.com/developer/tier1page.jsp?passedItemId=16_requestid=35037 http://developer.att.com/developer/16 Version: 7 Status: 5 (db_redir_perm) Fetch time: Fri Sep 16 00:43:33 CDT 2011 Modified time: Wed Dec 31 18:00:00 CST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.0 Signature: null Metadata: _pst_: moved(12), lastModified=0: http://developer.att.com/developer/forward.jsp?passedItemId=16 -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3261546.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch redirect treatment
As far as I understood redirected urls are scored 0 and that is why fetcher does not pick them up in the earlier depths. They may be crawled starting depth 4 depending on the size of the seed list. -Original Message- From: abhayd ajdabhol...@hotmail.com To: nutch-user nutch-u...@lucene.apache.org Sent: Wed, Aug 17, 2011 4:41 pm Subject: Re: nutch redirect treatment thanks for response. But my issue is after redirect new url is not being crawled. Not a scoring issue. -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-redirect-treatment-tp3261546p3263311.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: fetcher runs without error with no internet connection
Hi Lewis, I stopped fetcher and started it on the same segment again. But before doing that I turned off modem and fetcher started giving Unknown.Host exception. It was not giving any error, with dsl failure, i.e. I was not able to connect to any sites. Again this is nutch-1.2. Thanks. Alex. -Original Message- From: lewis john mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Tue, Aug 23, 2011 6:37 am Subject: Re: fetcher runs without error with no internet connection Hi Alex, Did you get anywhere with this? What condition led to you seeing unknown host exception? Unless segment gets corrupted, I would assume you could fetch again. Hopefully you can confirm this. On Tue, Aug 16, 2011 at 9:23 PM, alx...@aim.com wrote: Hello, After running bin/nutch fetch $segment for 2 days, internet connection was lost, but nutch did not give any errors. Usually I was seeing Unknown host exception before. Any ideas what happened and is it OK to stop the fetch and run it again on the same (old) segment? This is nutch -1.2 Thanks. Alex. -- *Lewis*
Re: fetcher runs without error with no internet connection
It is the DNS problem, because it was giving a lot of UnknownHost exception. I decreased thread number to 5, but still DSL fails periodically. I wondered what is the common internet connection for fetching about 3500 domains. I currently have DSL with 3 Mps. Thanks. Alex. -Original Message- From: Markus Jelsma markus.jel...@openindex.io To: user user@nutch.apache.org Sent: Mon, Aug 29, 2011 5:19 pm Subject: Re: fetcher runs without error with no internet connection I didn't say you have a DNS-problem only that these exception may occur if the DNS can't keep up with the requests you make. Make sure you have a DNS problem before trying to solve a problem that doesn't exist. It's normal to have these exceptions once in a while. Solving DNS issues are beyond the scope of this list. You may, however, opt for some DNS caching in your network. What is the solution to the issue with DNS server? -Original Message- From: Markus Jelsma markus.jel...@openindex.io To: user user@nutch.apache.org Sent: Tue, Aug 23, 2011 12:32 pm Subject: Re: fetcher runs without error with no internet connection If you fetch too hard, your DNS-server may not be able to keep up. Hi Lewis, I stopped fetcher and started it on the same segment again. But before doing that I turned off modem and fetcher started giving Unknown.Host exception. It was not giving any error, with dsl failure, i.e. I was not able to connect to any sites. Again this is nutch-1.2. Thanks. Alex. -Original Message- From: lewis john mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Tue, Aug 23, 2011 6:37 am Subject: Re: fetcher runs without error with no internet connection Hi Alex, Did you get anywhere with this? What condition led to you seeing unknown host exception? Unless segment gets corrupted, I would assume you could fetch again. Hopefully you can confirm this. On Tue, Aug 16, 2011 at 9:23 PM, alx...@aim.com wrote: Hello, After running bin/nutch fetch $segment for 2 days, internet connection was lost, but nutch did not give any errors. Usually I was seeing Unknown host exception before. Any ideas what happened and is it OK to stop the fetch and run it again on the same (old) segment? This is nutch -1.2 Thanks. Alex.
spellchecking in nutch solr
Hello, I have tried to implement spellchecker based on index in nutch-solr by adding spell field to schema.xml and making it a copy from content field. However, this increased data folder size twice and spell filed as a copy of content field appears in xml feed which is not necessary. Is it possible to implement spellchecker without this issue? Thanks. Alex.
Re: Crawl fails - Input path does not exist
Comparing with nutch-1.2 I do not see any content folder under segments ones. Does this mean that we cannot put store.content to false in nutch1-3? Thanks. Alex. -- View this message in context: http://lucene.472066.n3.nabble.com/Crawl-fails-Input-path-does-not-exist-tp996823p3334709.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: more from link
I see what is in done in nutch results. Results are grouped with 1 doc in each group. I need to group with 3 max docs in each group. In Solr, it is impossible to paginate when grouping with more than 1 doc in each group. Google can do it with 5 docs in the first group, as I see. Thanks. Alex. -Original Message- From: Markus Jelsma markus.jel...@openindex.io To: user user@nutch.apache.org Sent: Wed, Sep 14, 2011 2:24 am Subject: Re: more from link Field collpase on site or host field. Hello, In nutch search page there is more from link in case when there are many results from the same site. Is there a way to have this kind of link in case when Solr is used as front end.? Thanks. Alex.
restart a failed job
Hello, I wondered if it is possible to restart a failed job in nutch-1.3 version. I have this error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/ after fetching for 5 days. I know the reason for the error, but do not want to restart the whole process from the beginning. I use nutch in local mode in one machine. Thanks. Alex.
fetch command does not parse
Hello, I tried fetch command with the following config property namefetcher.store.content/name valuefalse/value descriptionIf true, fetcher will store content./description /property property namefetcher.parse/name valuetrue/value descriptionIf true, fetcher will parse content. Default is false, which means that a separate parsing step is required after fetching is finished./description /property However, fetcher did not parse. There is no parse folders under the segment and updatedb gives errors. I wonder how to crawl without storing content in this version 1.3? Thanks. Alex.
Re: Removing urls from crawl db
I think you must add a regex to regex-urlfilter.txt . In that case those urls will not be fetched by fetcher. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user user@nutch.apache.org Sent: Tue, Nov 1, 2011 10:35 am Subject: Re: Removing urls from crawl db Already did that. But it doesn't allow me to delete urls from the list to be crawled. On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema ferdy.gal...@kalooga.comwrote: As for reading the crawldb, you can use org.apache.nutch.crawl.**CrawlDbReader. This allows for dumping the crawldb into a readable textfile as well as querying individual urls. Run without args to see its usage. On 10/31/2011 08:47 PM, Markus Jelsma wrote: Hi Write an regex URL filter and use it the next time you update the db; it will disappear. Be sure to backup the db first in case your regex catches valid URL's. Nutch 1.5 will have an option to keep the previous version of the DB after update. cheers We accidentally injected some urls into the crawl database and I need to go remove them. From what I understand, in 1.4 I can view and modify the urls and indexes. But I can't seem to find any information on how to do this. Is there anything regarding this available?
Re: how use NUTCH-16 in my nutch 1.3?
I think this patch already included in the current version. -Original Message- From: mina tahereganji...@gmail.com To: nutch-user nutch-u...@lucene.apache.org Sent: Wed, Nov 2, 2011 7:08 pm Subject: how use NUTCH-16 in my nutch 1.3? i want to use NUTCH-61 in http://issues.apache.org/jira/browse/NUTCH-61 https://issues.apache.org/jira/browse/NUTCH-61 but i don't know how use that and use it in my nutch 1.3? help me. -- View this message in context: http://lucene.472066.n3.nabble.com/how-use-NUTCH-16-in-my-nutch-1-3-tp3473096p3473096.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Fetching just some urls outside domain
Hello, It is interesting to know how can one put a filter on outlinks? I mean if I have a regex, in which file should I put it? For example, I want nutch to ignore outlinks ending with .info. Thanks. Alex. -Original Message- From: Arkadi.Kosmynin arkadi.kosmy...@csiro.au To: user user@nutch.apache.org Sent: Thu, Dec 1, 2011 1:44 pm Subject: RE: Fetching just some urls outside domain Hi Adriana, You can try Arch for this: http://www.atnf.csiro.au/computing/software/arch You can configure it to crawl your web sites plus sets of miscellaneous URLs called bookmarks in Arch. Arch is a free extension of Nutch. Right now, only Arch based on Nutch 1.2 is available for downloading. We are about to release Arch based on Nutch 1.4. Regards, Arkadi -Original Message- From: Adriana Farina [mailto:adriana.farin...@gmail.com] Sent: Thursday, 1 December 2011 7:58 PM To: user@nutch.apache.org Subject: Re: Fetching just some urls outside domain Hi! Thank you for your answer. You're right, maybe an example would explain better what I need to do. I have to perform the following task. I have to explore a specific domain (. gov.it) and I have an initial set of seeds, for example www.aaa.it, www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch pages outside that domain. However some resources I need to download (documents) are stored on web sites that are not inside the domain I'm interested in. For example: www.aaa.it/subfolder/albi redirects to www.somesite.it (where www.somesite.it is not inside my domain). Nutch will not fetch that page since I told it to behave that way, but I need to download documents stored on www.somesite.it. So I need nutch to go outside the domain I specified only when it sees the words albi or albo inside the url, since that words identify the documents I need. How can I do this? I hope I've been clear. :) 2011/11/30 Lewis John Mcgibbney lewis.mcgibb...@gmail.com Hi Adriana, This should be achievable through fine grained URL filters. It is kindof hard to substantiate on this without you providing some examples of the type of stuff you're trying to do! Lewis On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina adriana.farin...@gmail.com wrote: Hello, I'm using nutch 1.3 from just a month, so I'm not an expert. I configured it so that it doesn't fetch pages outside a specific domain. However now I need to let it fetch pages outside the domain I choosed but only for some urls (not for all the urls I have to crawl). How can I do this? I have to write a new plugin? Thanks. -- *Lewis*
Re: Fetching just some urls outside domain
If I understand you correctly, you state that even if my question is related to the current thread, nevertheless I must open a new one? -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Thu, Dec 1, 2011 3:01 pm Subject: Re: Fetching just some urls outside domain Nutch comes packed with quite a few url-filters out of the box. They just need some tuning. Have a look in NUTCH_HOME/conf Also have a look at the corresponding plugins. Realistically you should really start a new thread for new questions :0) I think you're looking for the urlfilter-domain plugin On Thu, Dec 1, 2011 at 10:48 PM, alx...@aim.com wrote: Hello, It is interesting to know how can one put a filter on outlinks? I mean if I have a regex, in which file should I put it? For example, I want nutch to ignore outlinks ending with .info. Thanks. Alex. -Original Message- From: Arkadi.Kosmynin arkadi.kosmy...@csiro.au To: user user@nutch.apache.org Sent: Thu, Dec 1, 2011 1:44 pm Subject: RE: Fetching just some urls outside domain Hi Adriana, You can try Arch for this: http://www.atnf.csiro.au/computing/software/arch You can configure it to crawl your web sites plus sets of miscellaneous URLs called bookmarks in Arch. Arch is a free extension of Nutch. Right now, only Arch based on Nutch 1.2 is available for downloading. We are about to release Arch based on Nutch 1.4. Regards, Arkadi -Original Message- From: Adriana Farina [mailto:adriana.farin...@gmail.com] Sent: Thursday, 1 December 2011 7:58 PM To: user@nutch.apache.org Subject: Re: Fetching just some urls outside domain Hi! Thank you for your answer. You're right, maybe an example would explain better what I need to do. I have to perform the following task. I have to explore a specific domain (. gov.it) and I have an initial set of seeds, for example www.aaa.it, www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch pages outside that domain. However some resources I need to download (documents) are stored on web sites that are not inside the domain I'm interested in. For example: www.aaa.it/subfolder/albi redirects to www.somesite.it (where www.somesite.it is not inside my domain). Nutch will not fetch that page since I told it to behave that way, but I need to download documents stored on www.somesite.it. So I need nutch to go outside the domain I specified only when it sees the words albi or albo inside the url, since that words identify the documents I need. How can I do this? I hope I've been clear. :) 2011/11/30 Lewis John Mcgibbney lewis.mcgibb...@gmail.com Hi Adriana, This should be achievable through fine grained URL filters. It is kindof hard to substantiate on this without you providing some examples of the type of stuff you're trying to do! Lewis On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina adriana.farin...@gmail.com wrote: Hello, I'm using nutch 1.3 from just a month, so I'm not an expert. I configured it so that it doesn't fetch pages outside a specific domain. However now I need to let it fetch pages outside the domain I choosed but only for some urls (not for all the urls I have to crawl). How can I do this? I have to write a new plugin? Thanks. -- *Lewis* -- *Lewis*
Re: how give several sites to nutch to crawl?
I think you should add this to nutch-site.xml property namegenerate.max.count/name value1000/value descriptionThe maximum number of urls in a single fetchlist. -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode. /description /property and set topN to -1 Alex. -Original Message- From: mina tahereganji...@gmail.com To: nutch-user nutch-u...@lucene.apache.org Sent: Sat, Dec 3, 2011 6:10 pm Subject: Re: how give several sites to nutch to crawl? thanks for your answer. i use this script to crawl my sites: $NUTCH_HOME/bin/nutch inject $NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/seedUrls for((i=0; i $depth; i++)) do echo --- Beginning crawl at depth `expr $i + 1` of $depth --- $NUTCH_HOME/bin/nutch generate $NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/crawl1/segments $topN if [ $? -ne 0 ] then echo deepcrawler: Stopping at depth $depth. No more URLs to fetch. break fi segment1=`ls -d $NUTCH_HOME/bin/crawl1/segments/* | tail -1` $NUTCH_HOME/bin/nutch fetch $segment1 if [ $? -ne 0 ] then echo deepcrawler: fetch $segment1 at depth `expr $i + 1` failed. echo deepcrawler: Deleting segment $segment1. rm $RMARGS $segment1 continue fi $NUTCH_HOME/bin/nutch parse $segment1 $NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/bin/crawl1/crawldb $segment1 done echo - Merge Segments (Step 5 of $steps) - $NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/bin/crawl1/MERGEDsegments $NUTCH_HOME/bin/crawl1/segments/* if [ $safe != yes ] then rm $RMARGS $NUTCH_HOME/bin/crawl1/segments else rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPsegments mv $MVARGS $NUTCH_HOME/bin/crawl1/segments $NUTCH_HOME/bin/crawl1/BACKUPsegments fi mv $MVARGS $NUTCH_HOME/bin/crawl1/MERGEDsegments $NUTCH_HOME/bin/crawl1/segments echo - Invert Links (Step 6 of $steps) - $NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/bin/crawl1/linkdb $NUTCH_HOME/bin/crawl1/segments/* if [ $safe != yes ] then rm $RMARGS $NUTCH_HOME/bin/crawl1/NEWindexes rm $RMARGS $NUTCH_HOME/bin/crawl1/index else rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindexes rm $RMARGS $NUTCH_HOME/bin/crawl1/BACKUPindex mv $MVARGS $NUTCH_HOME/bin/crawl1/NEWindexes $NUTCH_HOME/bin/crawl1/BACKUPindexes mv $MVARGS $NUTCH_HOME/bin/crawl1/index $NUTCH_HOME/bin/crawl1/BACKUPindex fi $NUTCH_HOME/bin/nutch solrindex http://$HOST:8983/solr/ $NUTCH_HOME/bin/crawl1/crawldb $NUTCH_HOME/bin/crawl1/linkdb $NUTCH_HOME/bin/crawl1/segments/* but nutch don't crawl all page in any site, for example when topN=1000, nutch crawl 700 page from site1 and 250 from site2 and 40 from site3 and 10 page from site4. i want nutch crawl 1000 page from any site.help me. -- View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3558152.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Can't crawl a domain; can't figure out why.
It seems that robots.txt in libraries.mit.edu has a lot of restrictions. Alex. -Original Message- From: Chip Calhoun ccalh...@aip.org To: user user@nutch.apache.org; 'markus.jel...@openindex.io' markus.jel...@openindex.io Sent: Tue, Dec 20, 2011 7:28 am Subject: RE: Can't crawl a domain; can't figure out why. I just compared this against a similar crawl of a completely different domain which I know works, and you're right on both counts. The parser doesn't parse a file, and nothing is sent to the solrindexer. I tried a crawl with more documents and found that while I can get documents from mit.edu, I get absolutely nothing from libraries.mit.edu. I get the same effect using Nutch 1.3 as well. I don't think we're dealing with truncated files. I'm willing to believe it's a parse error, but how could I tell? I've spoken with some helpful people from MIT, and they don't see a reason why this wouldn't work. Chip -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, December 19, 2011 5:01 PM To: user@nutch.apache.org Subject: Re: Can't crawl a domain; can't figure out why. Nothing peculiar, looks like Nutch 1.4 right? But you also didn't mention the domain you can't crawl. libraries.mit.edu seems to work, although the indexer doesn't seem to send a document in and the parser doesn't mention parsing that file. Either the file throws a parse error or is truncated or I'm trying to crawl pages from a number of domains, and one of these domains has been giving me trouble. The really irritating thing is that it did work at least once, which led me to believe that I'd solved the problem. I can't think of anything at this point but to paste my log of a failed crawl and solrindex and hope that someone can think of anything I've overlooked. Does anything look strange here? Thanks, Chip 2011-12-19 16:31:01,010 WARN crawl.Crawl - solrUrl is not set, indexing will be skipped... 2011-12-19 16:31:01,404 INFO crawl.Crawl - crawl started in: mit-c-crawl 2011-12-19 16:31:01,420 INFO crawl.Crawl - rootUrlDir = mit-c-urls 2011-12-19 16:31:01,420 INFO crawl.Crawl - threads = 10 2011-12-19 16:31:01,420 INFO crawl.Crawl - depth = 1 2011-12-19 16:31:01,420 INFO crawl.Crawl - solrUrl=null 2011-12-19 16:31:01,420 INFO crawl.Crawl - topN = 50 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: starting at 2011-12-19 16:31:01 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: crawlDb: mit-c-crawl/crawldb 2011-12-19 16:31:01,420 INFO crawl.Injector - Injector: urlDir: mit-c-urls 2011-12-19 16:31:01,436 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2011-12-19 16:31:02,854 INFO plugin.PluginRepository - Plugins: looking in: C:\Apache\apache-nutch-1.4\runtime\local\plugins 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered Plugins: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Basic URL Normalizer (urlnormalizer-basic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Html Parse Plug-in (parse-html) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Http / Https Protocol Plug-in (protocol-httpclient) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -HTTP Framework (lib-http) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Pass-through URL Normalizer (urlnormalizer-pass) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Http Protocol Plug-in (protocol-http) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Regex URL Normalizer (urlnormalizer-regex) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -Tika Parser Plug-in (parse-tika) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -CyberNeko HTML Parser (lib-nekohtml) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository -URL Meta Indexing Filter (urlmeta) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Registered Extension-Points: 2011-12-19 16:31:02,917 INFO plugin.PluginRepository - Nutch
Re: Solrdedup fails due to date format
Hello, I took a look to source of SolrDeleteDuplicates class. The patch is already applied. Any ideas what might be wrong? I issue this command bin/nutch solrdedup http://127.0.0.1:8983/solr/ and the solr schema is the one that comes with nutch. Thanks in advance. Alex. -Original Message- From: Alexander Aristov alexander.aris...@gmail.com To: user user@nutch.apache.org Cc: nutch-user nutch-u...@lucene.apache.org Sent: Tue, Jan 31, 2012 9:34 pm Subject: Re: Solrdedup fails due to date format what is your solr schema configuration for nutch fields? Best Regards Alexander Aristov On 1 February 2012 09:26, alx...@aim.com wrote: Hello, I have tried solrdedup in nutch-1.3 and 1,4. Both give WARNING: Error reading a field from document : SolrDocument[{boost=5.38071E-4, digest=79e4d5033ef83223b17c56b7c7d853b3}] java.lang.NumberFormatException: For input string: There is patch at https://issues.apache.org/jira/browse/NUTCH-986 and it is stated that is a fix for 1.3 Any comment on this. Thanks. Alex.
Re: http.redirect.max
Hello, I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones redirected urls to later depths. What is the correct config setting to have nutch crawl redirected urls immediately. I need it because I have restriction on depth be at most 2. Thanks. Alex. -Original Message- From: xuyuanme xuyua...@gmail.com To: user user@nutch.apache.org Sent: Fri, Feb 24, 2012 1:31 am Subject: Re: http.redirect.max The config file is used for some proof of concept testing so the content might be confusing, please ignore some incorrect part. Yes from my end I can see the crawl for website http://www.scotland.gov.uk is redirected as expected. However the website I tried to crawl is a bit more tricky. Here's what I want to do: 1. Set http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B as the seed page 2. And try to crawl one of the link (http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.OverviewDrugName=BACIGUENT) as a test If you click the link, you'll find the website use redirect and cookie to control page navigation. So I used protocol-httpclient plugin instead of protocol-http to handle the cookie. However, the redirect does not happen as expected. The only way I can fetch second link is to manually change response = getResponse(u, datum, *false*) call to response = getResponse(u, datum, *true*) in org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the lib-http plugin. So my issue is related to this specific site http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_BrowseDrugInitial=B lewis john mcgibbney wrote I've checked working with redirects and everything seems to work fine for me. The site I checked on http://www.scotland.gov.uk temp redirect to http://home.scotland.gov.uk/home Nutch gets this fine when I do some tweaking with nutch-site.xml redirects property -1 (just to demonstrate, I would usually not set it so) Lewis -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html Sent from the Nutch - User mailing list archive at Nabble.com.
different fetch interval for each depth urls
Hello, I need to have different fetch intervals for initial seed urls and urls extracted from them at depth 1. How this can be achieved. I tried -adddays option in generate command but it seems it cannot be used to solve this issue. Thanks in advance. Alex.
Re: different fetch interval for each depth urls
I need to make this as a cron job, so cannot do changes manually. My problem is to index newspaper sites, but only new links that are added every day and not fetch ones that have already been fetched. Thanks. Alex. -Original Message- From: Markus Jelsma markus.jel...@openindex.io To: user user@nutch.apache.org Cc: nutch-user nutch-u...@lucene.apache.org Sent: Thu, Mar 1, 2012 10:30 pm Subject: Re: different fetch interval for each depth urls Well, you could set a new default fetch interval in your configuration after the first crawl cycle but the depth information is lost if you continue crawling so there is no real solution. What problem are you trying to solve anyway? On Fri, 2 Mar 2012 00:19:34 -0500 (EST), alx...@aim.com wrote: Hello, I need to have different fetch intervals for initial seed urls and urls extracted from them at depth 1. How this can be achieved. I tried -adddays option in generate command but it seems it cannot be used to solve this issue. Thanks in advance. Alex. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
using less resources
Hello, As far as I understood nutch recrawls urls when their fetch time has past current time regardless if those urls were modified or not. Is there any initiative on restricting recrawls to only those urls that have modified time(MT) greater than the old MT? In detail: if nutch have crawled a url with next fetch time in 30 days, then in the second recrawl nutch must visit this url, retrieve its modified time and compare it with modified time that we have in the crawldb and recrawl it if the new MT is greater than the old one, otherwise skip it. Thanks. Alex.
nutch-2.0 updatedb and parse commands
Hello, It seems to me that all options to updatedb command that nutch 1.4 has, have been removed in nutch-2.0. I would like to know if this was done purposefully or they will be added later? Also, how can I create multiple doc using parse command? It seem there is no sufficient arguments to parse command too. Thanks in advance. Alex.
Re: nutch-2.0 updatedb and parse commands
Hi Lewis, In 1.X version there are -noAdditions options to updatedb and -adddays option to generate commands. How something similar to them can be done in 2.X version? Here, http://wiki.apache.org/nutch/Nutch2Roadmap it is stated Modify code so that parser can generate multiple documents which is what 1.x does but not 2.0 It is my understanding that 1.X's parser does not create multiple documents, though. Then what is the meaning of the above statement? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Tue, Jun 19, 2012 6:09 am Subject: Re: nutch-2.0 updatedb and parse commands Hi Alex, On Mon, Jun 18, 2012 at 8:11 PM, alx...@aim.com wrote: Hello, It seems to me that all options to updatedb command that nutch 1.4 has, have been removed in nutch-2.0. I would like to know if this was done purposefully or they will be added later? As you have noticed there are a number of differences between 1.X and 2.X. W.r.t the ones you highlight e.g. CLI, yes these are intended Also, how can I create multiple doc using parse command? Can you please elaborate slightly? It seem there is no sufficient arguments to parse command too. What would you like to see added? If you feel like adding functionality then please open a ticket and if possible submit a patch if you have time. I've been working with the parsing code and it works fine for me but I don't fully understand your comment so if you again could elaborate if would be excellent. Thanks Lewis Thanks in advance. Alex. -- Lewis
Re: using less resources
I was thinking of using last modified header, but it may be absent. In that case we could use signature of urls in the indexing time. I took a look to to code, it seems it is implemented but not working. I tested nutch-1.4 with a single url, solrindexer always sends the same number of documents to solr although none of the urls is changed. Thanks. Alex. -- View this message in context: http://lucene.472066.n3.nabble.com/using-less-resources-tp3985537p3990625.html Sent from the Nutch - User mailing list archive at Nabble.com.
parse and solrindex in nutch-2.0
Hello, I have tested nutch-2.0 with hbase and mysql trying to index only one url with depth 1. I tried to fetch an html tag value and parse it to metadata column in webpage object by adding parse-tag plugin. I saw there is no metadata member variable in Parse class, so I used putToMetadata function from Webpage class and it turned out that this function overwrites values for the same key, i.e, it keeps only the last tag value if there are multiple tags. Next bin/nutch solrindex http://127.0.0.1:8983/solr/ -all SolrIndexerJob: starting SolrIndexerJob: done. I did 1.bin/nutch inject 2.bin/nutch generate 3.bin/nutch fetch batchId 4.bin/nutch parse batchId 5.bin/nutch bin/nutch solrindex http://127.0.0.1:8983/solr/ -all There is no data added to solr index with the url I tried to index. Besides these, nutch-2.0 keeps content in the content column of webpage table if I put in the config property namefetcher.store.content/name valuefalse/value descriptionIf true, fetcher will store content./description /property Any ideas, what is done wrong or how to fix these issues are welcome. Thanks. Alex.
Re: parse and solrindex in nutch-2.0
Hi, Thank you for clarifications. Regarding the metadata, what would be a proper way of parsing end indexing multivalued tags in nutch-2.0 then? Thanks. Alex. -Original Message- From: Ferdy Galema ferdy.gal...@kalooga.com To: user user@nutch.apache.org Sent: Wed, Jun 27, 2012 1:20 am Subject: Re: parse and solrindex in nutch-2.0 Hi, Correct. When using specific_batchid or -all you have to run the updaterjob first. (Because it checks the dbupdate mark to not be null). But a workaround is to simply run the indexer with -reindex. This will ignore the db update mark and tries to index every parsed row (at any time). About the metadata: It's a known limitation that there cannot be any duplicate keys. (I'm not aware of any progress regarding this). fetcher.store.content indeed does not seem to work. This is a bug. I created an issue for this: NUTCH-1411 Ferdy. On Tue, Jun 26, 2012 at 11:47 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: update (or whatever the actual name of the command is) after parsing? On 25 June 2012 22:35, alx...@aim.com wrote: Hello, I have tested nutch-2.0 with hbase and mysql trying to index only one url with depth 1. I tried to fetch an html tag value and parse it to metadata column in webpage object by adding parse-tag plugin. I saw there is no metadata member variable in Parse class, so I used putToMetadata function from Webpage class and it turned out that this function overwrites values for the same key, i.e, it keeps only the last tag value if there are multiple tags. Next bin/nutch solrindex http://127.0.0.1:8983/solr/ -all SolrIndexerJob: starting SolrIndexerJob: done. I did 1.bin/nutch inject 2.bin/nutch generate 3.bin/nutch fetch batchId 4.bin/nutch parse batchId 5.bin/nutch bin/nutch solrindex http://127.0.0.1:8983/solr/ -all There is no data added to solr index with the url I tried to index. Besides these, nutch-2.0 keeps content in the content column of webpage table if I put in the config property namefetcher.store.content/name valuefalse/value descriptionIf true, fetcher will store content./description /property Any ideas, what is done wrong or how to fix these issues are welcome. Thanks. Alex. -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: parse and solrindex in nutch-2.0
Hi, I was planning to parse img tags from a url content and put it in metadata filed of Webpage storage class in nutch2.0 to retrieve them later in the indexing step. However, since there is no metadata data type variable in Parse class (compare with outlinks) this can not be done in nutch 2.0 (compare parse class with metadata type variable in nutch 1.X). One is restricted to use putToMetadata function of WebPage class which overwrites values, i.e.,if I try to put two metadata img_alt:alt1 img_alt:alt2 I get only the last value img_alt:alt2 in metadata field. So, my question is how img tag alt values can be indexed in nutch-2.0, provided that there are more than one img tag in all crawled urls? Do I need to parse them and store in one of the fields of webpage storage class or this step is not needed? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Tue, Jul 3, 2012 5:08 am Subject: Re: parse and solrindex in nutch-2.0 Hi, On Mon, Jul 2, 2012 at 8:21 PM, alx...@aim.com wrote: Regarding the metadata, what would be a proper way of parsing end indexing multivalued tags in nutch-2.0 then? Assuming you've taken a look into the schema, 'some' mutivalued fields are permitted out of the box. Are you having problems obtaining multiple values for some fields within the documents your trying to parse + index? Lewis
Re: updatedb in nutch-2.0 with mysql
Not sure if I understood correctly. I did Counters c currentJob.getCounters(); System.out.println(c.toString()); With Mysql DbUpdaterJob: starting Counters: 20 DbUpdaterJob: starting counter name=Counters: 20 FileSystemCounters FILE_BYTES_READ=878298 FILE_BYTES_WRITTEN=992362 Map-Reduce Framework Combine input records=0 Combine output records=0 Total committed heap usage (bytes)=260177920 CPU time spent (ms)=0 Map input records=1 Map output bytes=193 Map output materialized bytes=202 Map output records=1 Physical memory (bytes) snapshot=0 Reduce input groups=1 Reduce input records=1 Reduce output records=1 Reduce shuffle bytes=0 Spilled Records=2 SPLIT_RAW_BYTES=962 Virtual memory (bytes) snapshot=0 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 DbUpdaterJob: done Thanks. Alex. -Original Message- From: Ferdy Galema ferdy.gal...@kalooga.com To: user user@nutch.apache.org Sent: Wed, Jul 25, 2012 12:13 am Subject: Re: updatedb in nutch-2.0 with mysql Could you post the job counters? On Tue, Jul 24, 2012 at 8:14 PM, alx...@aim.com wrote: Hello, I am testing nutch-2.0 with mysql storage with 1 url. I see that updatedb command does not do anything. It does not add outlinks to the table as new urls and I do not see any error messages in hadoop.log Here is the log entries without plugin load info INFO crawl.DbUpdaterJob - DbUpdaterJob: starting 2012-07-24 10:53:46,142 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2012-07-24 10:53:46,979 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 1 2012-07-24 10:53:49,801 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 1 2012-07-24 10:53:49,806 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2012-07-24 10:53:49,807 INFO crawl.AbstractFetchSchedule - defaultInterval=2592 2012-07-24 10:53:49,807 INFO crawl.AbstractFetchSchedule - maxInterval=2592 2012-07-24 10:53:52,741 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2012-07-24 10:53:53,584 INFO crawl.DbUpdaterJob - DbUpdaterJob: done Also, I noticed that there is crawlId option to it. Where its value comes from? Btw, updatedb with no arguments works fine if Hbase is chosen for storage. Thanks. Alex. ~
Re: updatedb in nutch-2.0 with mysql
I queried webpage table and there are a few links in outlinks column. As I noted in the original letter updatedb works with Hbase. This is the counters output in the case of Hbase. bin/nutch updatedb DbUpdaterJob: starting counter name=Counters: 20 FileSystemCounters FILE_BYTES_READ=879085 FILE_BYTES_WRITTEN=993668 Map-Reduce Framework Combine input records=0 Combine output records=0 Total committed heap usage (bytes)=341442560 CPU time spent (ms)=0 Map input records=1 Map output bytes=1421 Map output materialized bytes=1457 Map output records=14 Physical memory (bytes) snapshot=0 Reduce input groups=13 Reduce input records=14 Reduce output records=13 Reduce shuffle bytes=0 Spilled Records=28 SPLIT_RAW_BYTES=701 Virtual memory (bytes) snapshot=0 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 DbUpdaterJob: done I tried crawling http://www.yahoo.com . The same issue is present. Thanks. Alex. -Original Message- From: Ferdy Galema ferdy.gal...@kalooga.com To: user user@nutch.apache.org Sent: Thu, Jul 26, 2012 6:26 am Subject: Re: updatedb in nutch-2.0 with mysql Yep I meant those counters. Looking at the code it seems just 1 record is passed around from mapper to reducer:This can only mean that no outlinks are outputted in the mapper. This might indicate that the url is not succesfully parsed. (Did you parse at all?) Are you able to peek in (or dump) your database with an external tool to see if outlinks are present before running the updater? Or perhaps check some parser log? On Wed, Jul 25, 2012 at 10:02 PM, alx...@aim.com wrote: Not sure if I understood correctly. I did Counters c currentJob.getCounters(); System.out.println(c.toString()); With Mysql DbUpdaterJob: starting Counters: 20 DbUpdaterJob: starting counter name=Counters: 20 FileSystemCounters FILE_BYTES_READ=878298 FILE_BYTES_WRITTEN=992362 Map-Reduce Framework Combine input records=0 Combine output records=0 Total committed heap usage (bytes)=260177920 CPU time spent (ms)=0 Map input records=1 Map output bytes=193 Map output materialized bytes=202 Map output records=1 Physical memory (bytes) snapshot=0 Reduce input groups=1 Reduce input records=1 Reduce output records=1 Reduce shuffle bytes=0 Spilled Records=2 SPLIT_RAW_BYTES=962 Virtual memory (bytes) snapshot=0 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 DbUpdaterJob: done Thanks. Alex. -Original Message- From: Ferdy Galema ferdy.gal...@kalooga.com To: user user@nutch.apache.org Sent: Wed, Jul 25, 2012 12:13 am Subject: Re: updatedb in nutch-2.0 with mysql Could you post the job counters? On Tue, Jul 24, 2012 at 8:14 PM, alx...@aim.com wrote: Hello, I am testing nutch-2.0 with mysql storage with 1 url. I see that updatedb command does not do anything. It does not add outlinks to the table as new urls and I do not see any error messages in hadoop.log Here is the log entries without plugin load info INFO crawl.DbUpdaterJob - DbUpdaterJob: starting 2012-07-24 10:53:46,142 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2012-07-24 10:53:46,979 INFO mapreduce.GoraRecordReader - gora.buffer.read.limit = 1 2012-07-24 10:53:49,801 INFO mapreduce.GoraRecordWriter - gora.buffer.write.limit = 1 2012-07-24 10:53:49,806 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2012-07-24 10:53:49,807 INFO crawl.AbstractFetchSchedule - defaultInterval=2592 2012-07-24 10:53:49,807 INFO crawl.AbstractFetchSchedule - maxInterval=2592 2012-07-24 10:53:52,741 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2012-07-24 10:53:53,584 INFO crawl.DbUpdaterJob - DbUpdaterJob: done Also, I noticed that there is crawlId option to it. Where its value comes from? Btw, updatedb with no arguments works fine if Hbase is chosen for storage. Thanks. Alex. ~
Re: updatedb in nutch-2.0 with mysql
I tried your suggestion with sql server and everything works fine. The issue that I had was with mysql though. mysql Ver 14.14 Distrib 5.5.18, for Linux (i686) using readline 5.1 After I have restarted mysql server and added to gora.properties mysql root user, updatdb adds outlinks as new urls, but as I noticed it did not remove values of prsmrk, gnmrk and ftcmrk as it happens in Hbase and as follows from code Mark.GENERATE_MARK.removeMarkIfExist(page);... in DbUpdateReducer.java I also see from time to time an error that text filed has size less than expected. It seems to me that nutch with mysql is still buggy, so I gave up using mysql with it in favor of Hbase. Thanks for your help. Alex. -Original Message- From: Ferdy Galema ferdy.gal...@kalooga.com To: user user@nutch.apache.org Sent: Fri, Jul 27, 2012 2:03 am Subject: Re: updatedb in nutch-2.0 with mysql I've just ran a crawl with Nutch 2.0 tag using the SqlStore. Please try to reproduce from a clean checkout/download. nano conf/nutch-site.xml #set http.agent.name and http.robots.agents properties ant clean runtime java -cp runtime/local/lib/hsqldb-2.2.8.jar org.hsqldb.Server -database.0 mem:0 -dbname.0 nutchtest #start sql server #open another terminal cd runtime/local bin/nutch inject ~/urlfolderWithOneUrl/ bin/nutch generate bin/nutch fetch batchIdFromGenerate bin/nutch parse batchIdFromGenerate bin/nutch updatedb bin/nutch readdb -stats #this will show multiple entries bin/nutch readdb -dump out #this will dump a readable text file in folder out/ (with multiple entries) If this works as expected, it might be something with your sql server? (What server are you running exactly?) Ferdy. On Thu, Jul 26, 2012 at 8:15 PM, alx...@aim.com wrote: I queried webpage table and there are a few links in outlinks column. As I noted in the original letter updatedb works with Hbase. This is the counters output in the case of Hbase. bin/nutch updatedb DbUpdaterJob: starting counter name=Counters: 20 FileSystemCounters FILE_BYTES_READ=879085 FILE_BYTES_WRITTEN=993668 Map-Reduce Framework Combine input records=0 Combine output records=0 Total committed heap usage (bytes)=341442560 CPU time spent (ms)=0 Map input records=1 Map output bytes=1421 Map output materialized bytes=1457 Map output records=14 Physical memory (bytes) snapshot=0 Reduce input groups=13 Reduce input records=14 Reduce output records=13 Reduce shuffle bytes=0 Spilled Records=28 SPLIT_RAW_BYTES=701 Virtual memory (bytes) snapshot=0 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 DbUpdaterJob: done I tried crawling http://www.yahoo.com . The same issue is present. Thanks. Alex. -Original Message- From: Ferdy Galema ferdy.gal...@kalooga.com To: user user@nutch.apache.org Sent: Thu, Jul 26, 2012 6:26 am Subject: Re: updatedb in nutch-2.0 with mysql Yep I meant those counters. Looking at the code it seems just 1 record is passed around from mapper to reducer:This can only mean that no outlinks are outputted in the mapper. This might indicate that the url is not succesfully parsed. (Did you parse at all?) Are you able to peek in (or dump) your database with an external tool to see if outlinks are present before running the updater? Or perhaps check some parser log? On Wed, Jul 25, 2012 at 10:02 PM, alx...@aim.com wrote: Not sure if I understood correctly. I did Counters c currentJob.getCounters(); System.out.println(c.toString()); With Mysql DbUpdaterJob: starting Counters: 20 DbUpdaterJob: starting counter name=Counters: 20 FileSystemCounters FILE_BYTES_READ=878298 FILE_BYTES_WRITTEN=992362 Map-Reduce Framework Combine input records=0 Combine output records=0 Total committed heap usage (bytes)=260177920 CPU time spent (ms)=0 Map input records=1 Map output bytes=193 Map output materialized bytes=202 Map output records=1 Physical memory (bytes) snapshot=0 Reduce input groups=1 Reduce input records=1 Reduce output records=1 Reduce shuffle bytes=0 Spilled Records=2 SPLIT_RAW_BYTES=962 Virtual memory (bytes) snapshot=0 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 DbUpdaterJob: done Thanks. Alex. -Original Message- From: Ferdy Galema
Re: Nutch 2.0 Solr 4.0 Alpha
Which storage do you use? Try solrindex with option -reindex. -Original Message- From: X3C TECH t...@x3chaos.com To: user user@nutch.apache.org Sent: Sun, Jul 29, 2012 10:58 am Subject: Re: Nutch 2.0 Solr 4.0 Alpha Forgot to do Specs VMWare Machine with CentOS 6.3 On Sun, Jul 29, 2012 at 1:53 PM, X3C TECH t...@x3chaos.com wrote: Hello, Has anyone been successful in hooking up Nutch 2 with Solr4? I seem to have my config screwed up somehow. I've added the Nutch fields to Solr's example schema and changed the field type from text' to text_general However when I index, I get the message SolrIndexerJob:starting SolrIndexerJob:Done but nothing has been indexed. Hadoop log shows no errors, neither does Solr terminal window. I even tried installing Solr 3.6.1 and copying the schema file as is, with no luck, same issue. Does something need to be adjusted in Nutch config? I made no adjustment when I built it, so it's stock beyond adjustments to hook up Hbase listed in tutorial. Your help is highly appreciated, as I'm really boggled by this!! Iggy
Re: Why won't my crawl ignore these urls?
Why do not you test your regex, to see if it really takes the urls you want to eliminate. It seems to me that your regex does not eliminate the type of urls you specified. Alex. -Original Message- From: Ian Piper ianpi...@tellura.co.uk To: user user@nutch.apache.org Sent: Mon, Jul 30, 2012 1:52 pm Subject: Re: Why won't my crawl ignore these urls? Hi again, Regarding disabling filters. I just checked in my nutch-default.xml and nutch-site.xml files. There is no reference to crawl.generate in either, which seems (http://wiki.apache.org/nutch/bin/nutch_generate) to suggest that urls should be filtered. Ian. -- On 30 Jul 2012, at 19:06, Markus Jelsma wrote: Hi, Either your regex is wrong, you haven't updated the CrawlDB with the new filters and/or you disabled filtering in the Generator. Cheers -Original message- From:Ian Piper ianpi...@tellura.co.uk Sent: Mon 30-Jul-2012 20:01 To: user@nutch.apache.org Subject: Why won't my crawl ignore these urls? Hi all, I have been trying to get to the bottom of this problem for ages and cannot resolve it - you're my last hope, Obi-Wan... I have a job that crawls over a client's site. I want to exclude urls that look like this: http://[clientsite.net]/resources/type.aspx?type=[whatever] http://[clientsite.net]/resources/type.aspx?type=[whatever] and http://[clientsite.net]/resources/topic.aspx?topic=[whatever] http://[clientsite.net]/resources/topic.aspx?topic=[whatever] To achieve this I thought I could put this into conf/regex-urlfilter.txt: [...] -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.* -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.* [...] Yet when I next run the crawl I see things like this: fetching http://[clientsite.net]/resources/topic.aspx?topic=10 http://[clientsite.net]/resources/topic.aspx?topic=10 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37 [...] fetching http://[clientsite.net]/resources/type.aspx?type=2 http://[clientsite.net]/resources/type.aspx?type=2 -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36 [...] and the corresponding pages seem to appear in the final Solr index. So clearly they are not being excluded. Is anyone able to explain what I have missed? Any guidance much appreciated. Thanks, Ian. -- Dr Ian Piper Tellura Information Services - the web, document and information people Registered in England and Wales: 5076715, VAT Number: 874 2060 29 http://www.tellura.co.uk/ http://www.tellura.co.uk/ Creator of monickr: http://monickr.com http://monickr.com/ 01926 813736 | 07973 156616 -- -- Dr Ian Piper Tellura Information Services - the web, document and information people Registered in England and Wales: 5076715, VAT Number: 874 2060 29 http://www.tellura.co.uk/ Creator of monickr: http://monickr.com 01926 813736 | 07973 156616 --
Re: Different batch id
Hi, Most likely you run generate command a few times and did not run updatedb. So, each generate command assigned different batchId s to its own set of urls. Alex. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user user@nutch.apache.org Sent: Tue, Jul 31, 2012 10:26 am Subject: Re: Different batch id Is there a specific place it's located? I turned on debugging, but I'm not seeing a batch id. On Mon, Jul 30, 2012 at 1:14 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Can you stick on debug logging and see what the batch ID's actually are? On Mon, Jul 30, 2012 at 6:12 PM, Bai Shen baishen.li...@gmail.com wrote: I set up Nutch 2.x with a new instance of HBase. I ran the following commands. bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all When looking at the parse log, I'm seeing a bunch of different batch id messages. These are all on urls that I did not inject into the database. Any ideas what's causing this? Thanks. -- Lewis
updatedb fails to put UPDATEDB_MARK in nutch-2.0
Hello, I noticed that updatedb command must remove gen, parse and fetch marks and put UPDATEDB_MARK mark. according to the code Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page); if (mark != null) { Mark.UPDATEDB_MARK.putMark(page, mark); } in DbUpdateReducer.java However, outputting markers in Hbase shows that updatedb removes all marks, except injector one and does not put UPDATEDB_MARK. Thanks. Alex.
Re: Nutch 2 solrindex
This is directly related to the thread I have opened yesterday. I think this is a bug, since updatedb fails to put update mark. I have fixed it by modifying code. I have a patch, but not sure if I can send it as an attachment. Alex. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user user@nutch.apache.org Sent: Wed, Aug 1, 2012 10:37 am Subject: Nutch 2 solrindex I'm trying to crawl using Nutch 2. However, I can't seem to get it to index to solr without adding -reindex to the command. And at that point it indexes everything I've crawled. I've tried both -all and the batch id, but neither one results in anything being indexed to solr. Any suggestions of what to look at? Thanks.
Re: Nutch 2 solrindex
The current code putting updb_mrk in dbUpdateReducer is as follows Utf8 mark = Mark.PARSE_MARK.removeMarkIfExist(page); if (mark != null) { Mark.UPDATEDB_MARK.putMark(page, mark); } the mark is always null, independent if there is PARSE_MARK or not. This function calls public Utf8 removeFromMarkers(Utf8 key) { if (markers == null) { return null; } getStateManager().setDirty(this, 20); return markers.remove(key); } it seems to me that getStateManager().setDirty(this, 20); removes marker and that is why the last line returns null. I tried to follow getStateManager().setDirty(this, 20) in the hierarchy of classes, but did not find anything useful. I have fixed the issue by replacing the above lines with Utf8 parse_mark = Mark.PARSE_MARK.checkMark(page); if (parse_mark != null) { Mark.UPDATEDB_MARK.putMark(page, parse_mark); Mark.PARSE_MARK.removeMark(page); } Thanks. Alex. -Original Message- From: Ferdy Galema ferdy.gal...@kalooga.com To: user user@nutch.apache.org Sent: Thu, Aug 2, 2012 12:16 am Subject: Re: Nutch 2 solrindex Hi, Do you want to open a Jira and attach the patch over there? Or just explain what the problem is caused. I'm curious to what this might be. Thanks, Ferdy. On Wed, Aug 1, 2012 at 9:27 PM, alx...@aim.com wrote: This is directly related to the thread I have opened yesterday. I think this is a bug, since updatedb fails to put update mark. I have fixed it by modifying code. I have a patch, but not sure if I can send it as an attachment. Alex. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user user@nutch.apache.org Sent: Wed, Aug 1, 2012 10:37 am Subject: Nutch 2 solrindex I'm trying to crawl using Nutch 2. However, I can't seem to get it to index to solr without adding -reindex to the command. And at that point it indexes everything I've crawled. I've tried both -all and the batch id, but neither one results in anything being indexed to solr. Any suggestions of what to look at? Thanks.
Re: Different batch id
Hi, I have found out that, what happens after bin/nutch generate -topN 1000 is that only 1000 of the urls have been marked by gnmrk Then bin/nutch fetch -all skips all urls that do not have gnmrk according to the code Utf8 mark = Mark.GENERATE_MARK.checkMark(page); if (!NutchJob.shouldProcess(mark, batchId)) { if (LOG.isDebugEnabled()) { LOG.debug(Skipping + TableUtil.unreverseUrl(key) + ; different batch id ( + mark + )); } return; } since shouldProcess(mark, batchId) returns false if mark is null. Then bin/nutch parse -all skips all urls that do not have fetch mark according to the code Utf8 mark = Mark.FETCH_MARK.checkMark(page); String unreverseKey = TableUtil.unreverseUrl(key); if (!NutchJob.shouldProcess(mark, batchId)) { LOG.info(Skipping + unreverseKey + ; different batch id); return; } this outputs to log as INFO and are those that you see in log file. So, it seems to me that -all option to fetch, parse and solrindex do not work as expected. Alex. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user user@nutch.apache.org Sent: Thu, Aug 2, 2012 5:59 am Subject: Re: Different batch id I just tried running this with the actual batch Id instead of using -all, and I'm still getting similar results. On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen baishen.li...@gmail.com wrote: I set up Nutch 2.x with a new instance of HBase. I ran the following commands. bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all When looking at the parse log, I'm seeing a bunch of different batch id messages. These are all on urls that I did not inject into the database. Any ideas what's causing this? Thanks.
Re: Nutch 2 encoding
Hi, I use hbase-0.92.1 and do not have problem with utf-8 chars. What is exactly your problem? Alex. -Original Message- From: Ake Tangkananond iam...@gmail.com To: user user@nutch.apache.org Sent: Thu, Aug 9, 2012 11:12 am Subject: Re: Nutch 2 encoding Hi, I'm debugging. I inserted a code to print out the encoding here in HtmlParser:java function getParse and it printed utf-8. So I think it might be the data store problem. What else could be the cause? Could you advise what next I should go for to have my Thai chars stored correctly in HBase? Can I simply go with the latest version of HBase? (Not sure if it is compatible with nutch 2.0) byte[] contentInOctets = page.getContent().array(); InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets)); EncodingDetector detector = new EncodingDetector(conf); detector.autoDetectClues(page, true); detector.addClue(sniffCharacterEncoding(contentInOctets), sniffed); String encoding = detector.guessEncoding(page, defaultCharEncoding); metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding); metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding); LOG.info(encoding : + encoding); input.setEncoding(encoding); Regards, Ake Tangkananond On 8/9/12 11:06 PM, Ake Tangkananond iam...@gmail.com wrote: Hi, Sorry for late reply. I was trying to figure out myself but seem no luck. I'm on Hbase with local deploy version 0.90.6, r1295128, the working version as said in Wiki: http://wiki.apache.org/nutch/Nutch2Tutorial Regards, Ake Tangkananond On 8/9/12 10:30 PM, Ferdy Galema ferdy.gal...@kalooga.com wrote: It depends on the datastore and possibly the server? What store are you using? On Thu, Aug 9, 2012 at 4:05 PM, Ake Tangkananond iam...@gmail.com wrote: Hi all, I just wonder if Nutch 2 is working fine with non english characters in your deployment? Thai language used to work fine for me in Nutch 1.5 but not in Nutch 2. Did I miss something. Anything I should check. Sorry for silly questions, but thank you in advance. ;-) Regards, Ake Tangkananond
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Hello, I am getting the same error and here is the log 2012-08-11 13:33:08,223 ERROR http.Http - Failed with the following error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178) at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:243) at org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:161) at org.apache.nutch.protocol.http.Http.getResponse(Http.java:68) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:142) at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:521) Thanks. Alex. -- View this message in context: http://lucene.472066.n3.nabble.com/java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp334p4000616.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
I was able to do jstack just before the program exited. The output is attached. -Original Message- From: alxsss alx...@aim.com To: user user@nutch.apache.org Sent: Sat, Aug 11, 2012 2:17 pm Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Hello, I am getting the same error and here is the log 2012-08-11 13:33:08,223 ERROR http.Http - Failed with the following error: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178) at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:243) at org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:161) at org.apache.nutch.protocol.http.Http.getResponse(Http.java:68) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:142) at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:521) Thanks. Alex. -- View this message in context: http://lucene.472066.n3.nabble.com/java-lang-OutOfMemoryError-GC-overhead-limit-exceeded-tp334p4000616.html Sent from the Nutch - User mailing list archive at Nabble.com.
updatedb error in nutch-2.0
Hello, I get the following error when I do bin/nutch updatedb in nutch-2.0 with hbase java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) I see this is because of reversing and unreversing urls. What is the idea behind this reversal and unreversal in nutch-2.0? Thanks. Alex.
Re: updatedb error in nutch-2.0
I found out that the key sent to unreverseUrl in DbUpdateMapper.map was :index.php/http This happened in the depth 3 and I checked seed file there was no line in the form of http:/index.php Thanks. Alex. -Original Message- From: Ferdy Galema ferdy.gal...@kalooga.com To: user user@nutch.apache.org Sent: Mon, Aug 13, 2012 1:53 am Subject: Re: updatedb error in nutch-2.0 Hi, In the specific case of Alex, it means that a row name in the database is malformed. Looking at the stacktrace lines in TableUtil, it looks like an url is stored without protocol (at least without a :). This is probably because of redirected urls not correctly being checked for wellformedness. If you look at line 664 in the FetcherReducer (HEAD) it writes out a new url directly as a row in the database. I have never experienced this exception and this might be because I changed some behaviour that makes sure a redirected url is handled a bit more like a general outlink. I have created an issue for this that I will update shortly: https://issues.apache.org/jira/browse/NUTCH-1448 Ferdy. On Mon, Aug 13, 2012 at 2:52 AM, j.sulli...@thomsonreuters.com wrote: The url is stored in a different order (reversed domain name:protocol:port and path) from the order normally seen in your web browser so that it can be searched more quickly in NoSQL data stores like hbase. Nutch has a brief explanation and convenience utility methods around this at TableUtil (http://nutch.apache.org/apidocs-2.0/org/apache/nutch/util/TableUtil.htm l) -Original Message- From: alx...@aim.com [mailto:alx...@aim.com] Sent: Monday, August 13, 2012 9:25 AM To: user@nutch.apache.org Subject: updatedb error in nutch-2.0 Hello, I get the following error when I do bin/nutch updatedb in nutch-2.0 with hbase java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) I see this is because of reversing and unreversing urls. What is the idea behind this reversal and unreversal in nutch-2.0? Thanks. Alex.
Re: nutch 2.0 with hbase 0.94.0
did you delete the old hbase jar from the lib dir? Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Mon, Aug 13, 2012 10:16 am Subject: Re: nutch 2.0 with hbase 0.94.0 Nutch contains no knowledge of which specific version of a backend you are using. This is however done through the gora-* dependencies managed by Ivy. Although this is a pretty convoluted way to do things, the best way to find this would be to check out Gora trunk [0], upgrade the hbase dependencies to whatever you need, compile and package the project then copy the relevant jar's over to your Nutch installation. This way you could run a standalone (development) hbase server and try running your Nutch configuration that way... hth Lewis [0] http://svn.apache.org/repos/asf/gora/trunk/ On Mon, Aug 13, 2012 at 6:11 PM, Ryan L. Sun lishe...@gmail.com wrote: hi all, I'm trying to set up nutch 2.0 with a existing hbase cluster (using hbase 0.94.0). Since nutch 2.0 supports an older version (0.90.4) of hbase, starting a nutch inject job crashed hbase daemon. Copying hbase 0.94.0's lib to nutch/runtime/local/lib folder as google search hinted doesn't work for me. Any suggestions are appreciated. Thanks. PS. I couldn't downgrade the existing hbase cluster software version, which is out of my hand. -- Lewis
updatedb goes over all urls in nutch-2.0
Hi, I noticed that updatedb command goes over all urls, even if they have been updated in the previous generate, fetch updatedb stages. As a result updatedb takes long time depending on the number of rows in the datastore. I thought maybe this is redundant and it must be restricted to not updated urls, only. Thanks. Alex.
fetcher fails on connection error in nutch-2.0 with hbase
After fetching for about 18 hours fetcher throws this error java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1045) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:897) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150) at $Proxy6.getClosestRowBefore(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:947) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:814) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:788) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1024) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:818) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1524) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1409) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:943) at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:820) at org.apache.hadoop.hbase.client.HTable.put(HTable.java:795) WARN zookeeper.ClientCnxn - Session 0x1393cf29d5e0003 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119) 2012-08-19 13:26:56,935 WARN zookeeper.ClientCnxn - Session 0x1393cf29d5e0003 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119) 2012-08-19 13:26:57,075 WARN zookeeper.RecoverableZooKeeper - Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase repeadetely and then fails. I checked the hbase process was alive. Any ideas what could cause this issue? Thanks. Alex.
speed of fetcher in nutch-2.0
Hello, I am using nutch-2.0 with hbase-0.92.1. I noticed that, in depth 1, 2,3 fetcher was fetching around 20K urls per hour. In depth 4 it fetches only 8K urls per hour. Any ideas what could cause this decrease in speed. I use local mode with 10 threads. Thanks. Alex.
Re: recrawl a URL?
This will work only for urls that has If-Modified-Since headers. But most urls does not have this header. Thanks. Alex. -Original Message- From: Max Dzyuba max.dzy...@comintelli.com To: Markus Jelsma markus.jel...@openindex.io; user user@nutch.apache.org Sent: Fri, Aug 24, 2012 9:02 am Subject: RE: recrawl a URL? Thanks again! I'll have to test it more then in my 1.5.1. Best regards, MaxMarkus Jelsma markus.jel...@openindex.io wrote:Hmm, i had to look it up but it is supported in 1.5 and 1.5.1: http://svn.apache.org/viewvc/nutch/tags/release-1.5.1/src/java/org/apache/nutch/indexer/IndexerMapReduce.java?view=markup -Original message- From:Max Dzyuba max.dzy...@comintelli.com Sent: Fri 24-Aug-2012 17:35 To: Markus Jelsma markus.jel...@openindex.io; user@nutch.apache.org Subject: RE: recrawl a URL? Thank you for the reply. Does it mean that it is not supported in latest stable release of Nutch? -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: den 24 augusti 2012 17:21 To: user@nutch.apache.org; Max Dzyuba Subject: RE: recrawl a URL? Hi, Trunk has a feature for this: indexer.skip.notmodified Cheers -Original message- From:Max Dzyuba max.dzy...@comintelli.com Sent: Fri 24-Aug-2012 17:19 To: user@nutch.apache.org Subject: recrawl a URL? Hello everyone, I run a crawl command every day, but I don't want Nutch to submit an update to Solr if a particular page hasn't changed. How do I achieve that? Right now the value of db.fetch.interval.default doesn't seem to help prevent the crawl since the updates are submitted to Solr as if the page has been changed. I know for sure that the page has not been changed. This happens for every new crawl command. Thanks in advance, Max
Re: Nutch 2 solrindex fails with no error
You can use -reindex option, since updt markers are not set properly in 2.0 release. -Original Message- From: Bai Shen baishen.li...@gmail.com To: user user@nutch.apache.org Sent: Mon, Sep 17, 2012 10:16 am Subject: Re: Nutch 2 solrindex fails with no error The problem appears to be that Nutch is not sending anything to solr. But I can't seem to find a reason in nutch as to why this is. On Sat, Sep 15, 2012 at 7:36 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Solr logs? On Fri, Sep 14, 2012 at 9:33 PM, Bai Shen baishen.li...@gmail.com wrote: I have a nutch 2 setup that I got working with solr about a month ago. I had to shelve it for a little while and I've recently come back to it. Everything seems to be working fine except for the solr indexing. To my knowledge, nothing has changed between then and now, but whenever I go to perform a solrindex, nothing gets index. My hbase, hadoop, and solr logs are all devoid of errors. The only thing I get in the command line is the following. SolrIndexerJob: starting SolrIndexerJob: done. Any suggestions of where to look to begin troubleshooting this would be appreciated. I'm baffled. Thanks. -- Lewis
updatedb in nutch-2.0 increases fetch time of all pages
Hello, updatedb in nutch-2.0 increases fetch time of all pages independent of if they have already been fetched or not. For example if updatedb is applied in depth 1 and page A is fetched and its fetchTime is 30 days from now, then as a result of running updatedb in depth 2 fetch time of page A will be 60 days from now and so on. Also, I wondered if it is possible to remove pages that do not pass filters from hbase datastore by using updatedb?. Thanks. Alex.
Re: Building Nutch 2.0
It seems to me that if you run nutch in deploy mode and make changes to config files then you need to rebuild .job file again unless you specify config_dir option in hadoop command. Alex. -Original Message- From: Christopher Gross cogr...@gmail.com To: user user@nutch.apache.org Sent: Mon, Oct 1, 2012 1:22 pm Subject: Re: Building Nutch 2.0 I have my 1.3 set up in a /proj/nutch/ directory that has the bin, conf, lib, logs, ..etc.., with NUTCH_HOME pointing there. I don't quite see what the difference would be for 2.x as long as NUTCH_HOME pointed to the right place. Is there documentation anywhere on how to do a deployment? -- Chris On Mon, Oct 1, 2012 at 3:59 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Chris, On Mon, Oct 1, 2012 at 8:52 PM, Christopher Gross cogr...@gmail.com wrote: OK, I added the port being used by hbase to iptables, and now I'm farther. I'm getting: 12/10/01 19:44:17 ERROR fetcher.FetcherJob: Fetcher: No agents listed in 'http.agent.name' property. But I do have an entry there, and it matches the first in the robots.agents as well. This can only mean that you have not recompiled this stuff into the runtime/local directory. How should I have this laid out? Should I be running out of the 'runtime' dir, or is it fine that I've pulled all those files out and into a /proj/nutch-2.1/ directory (so there's a bin, conf, lib, ..etc.. in there, with NUTCH_HOME pointing to that dir). OK so you are running locally. I can't say whether its OK to copy the directories and their content elsewhere as I've never done it however I would avoid unless absolutely necessary. It terms of the directory layout Nutch 2.x is identical to 1.x. It really helps if you make explicit which back end you intend to use as the config may alter accordingly.
nutch-2.0 generate in deploy mode
Hello, I use nutch-2.0 with hadoop-0.20.2. bin/nutch generate command takes 87% of cpu in deploy mode versus 18% in local mode. Any ideas how to fix this issue? Thanks. Alex.
Re: Building Nutch 2.0
According to code in bin/nutch if you have .job file in you NUTCH_HOME then it means that you run it in deploy mode. If there is no .job file then you run it in local mode, so you do not need to build nutch each time you change conf files. Alex. -Original Message- From: Christopher Gross cogr...@gmail.com To: user user@nutch.apache.org Sent: Tue, Oct 2, 2012 5:31 am Subject: Re: Building Nutch 2.0 Well i'm not using the deploy directory (and I can't get the hadoop to work, so the .job file shouldn't matter). I just don't see how changing the configurations (like the agent name string) would warrant rebuilding the project. I can understand it if you're switching between the storage mechanism (MySQL db vs HBase) because it is only including what is necessary (though it would be better to just have it all there in some cases), but for just a simple change I don't quite get it. Lewis -- if every time I change something minor like http.agent.name in a config file, will I have to rebuild? -- Chris On Mon, Oct 1, 2012 at 4:49 PM, alx...@aim.com wrote: It seems to me that if you run nutch in deploy mode and make changes to config files then you need to rebuild .job file again unless you specify config_dir option in hadoop command. Alex. -Original Message- From: Christopher Gross cogr...@gmail.com To: user user@nutch.apache.org Sent: Mon, Oct 1, 2012 1:22 pm Subject: Re: Building Nutch 2.0 I have my 1.3 set up in a /proj/nutch/ directory that has the bin, conf, lib, logs, ..etc.., with NUTCH_HOME pointing there. I don't quite see what the difference would be for 2.x as long as NUTCH_HOME pointed to the right place. Is there documentation anywhere on how to do a deployment? -- Chris On Mon, Oct 1, 2012 at 3:59 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Chris, On Mon, Oct 1, 2012 at 8:52 PM, Christopher Gross cogr...@gmail.com wrote: OK, I added the port being used by hbase to iptables, and now I'm farther. I'm getting: 12/10/01 19:44:17 ERROR fetcher.FetcherJob: Fetcher: No agents listed in 'http.agent.name' property. But I do have an entry there, and it matches the first in the robots.agents as well. This can only mean that you have not recompiled this stuff into the runtime/local directory. How should I have this laid out? Should I be running out of the 'runtime' dir, or is it fine that I've pulled all those files out and into a /proj/nutch-2.1/ directory (so there's a bin, conf, lib, ..etc.. in there, with NUTCH_HOME pointing to that dir). OK so you are running locally. I can't say whether its OK to copy the directories and their content elsewhere as I've never done it however I would avoid unless absolutely necessary. It terms of the directory layout Nutch 2.x is identical to 1.x. It really helps if you make explicit which back end you intend to use as the config may alter accordingly.
Re: Error parsing html
Can you provide a few lines of log or the url that gives the exception? -Original Message- From: CarinaBambina carina.rei...@yahoo.de To: user user@nutch.apache.org Sent: Tue, Oct 2, 2012 2:04 pm Subject: Re: Error parsing html Thanks for the reply. I'm now using Nutch 1.5.1, but nothing has changed so far. While debugging I came across the runParser method in ParseUtil class in which the task.get(MAX_PARSE_TIME, TimeUnit.SECONDS); returns null. Therefore also the ParseResult object is null, which makes the program raise the ParseException. Right now i have no clue what the problem could be. I also tried using all default configurations, but nothing changed. -- View this message in context: http://lucene.472066.n3.nabble.com/Error-parsing-html-tp3994699p4011495.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Error parsing html
I checked the url you privided with parsechecker and they are parsed correctly. You can check yourself by doing bin/nutch parsechecker yoururl. In you implementation can you check if segment dir has correct permission. Alex. -Original Message- From: CarinaBambina carina.rei...@yahoo.de To: user user@nutch.apache.org Sent: Tue, Oct 9, 2012 10:03 am Subject: Re: Error parsing html i now also tried using all source files itself instead of the nutch.jar, but nothing changed. Is there anyone who has an idea what the reason for this error might be? Or at least where and what i should look for? Any hint?! Thanks in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/Error-parsing-html-tp3994699p4012755.html Sent from the Nutch - User mailing list archive at Nabble.com.
nutch-2.0-fetcher fails in reduce stage
Hello, I try to use nutch-2.0, hadoop-1.03, hbase-0.92.1 in pseudo distributed mode with iptables turned off. As soon as map reaches 100%, fetcher works for a few minutes and fails with the error java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1045) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:897) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150) at $Proxy10.getClosestRowBefore(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:947) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:814) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:788) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1024) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:818) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1524) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1409) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:943) at org.apache.gora.hbase.store.HBaseTableConnection.close(HBaseTableConnection.java:96) at org.apache.gora.hbase.store.HBaseStore.close(HBaseStore.java:599) at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:579) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) org.apache.gora.util.GoraException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface to master/192.168.1.4:60020 after attempts=1 at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:118) at org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:88) at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.lt;initgt;(ReduceTask.java:569) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:638) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface to master/192.168.1.4:60020 after attempts=1 at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:242) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1278) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1235) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1222) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:918) at
Re: nutch-2.0-fetcher fails in reduce stage
Hello, Today, I closely followed all hbase and hadoop logs. As soon as map reached 100% reduce was 33%. Then when reduce reached 66% I saw in hadoop's datanode log the following error 2012-10-16 22:44:54,634 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-179532189-192.168.1.4-50010-1349640973409, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 65557 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:268) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107) at java.lang.Thread.run(Thread.java:662) And hbase's regionserver stopped without any errors. I do not see any errors in hbase master and hadoop namenode logs. @Lewis Not sure what do you mean about configuration to run behind proxy. I closely followed hbase configuration at http://hbase.apache.org/book/configuration.html box1 --is a local fedora linux box with dynamic ip box2 --is a dedicated fedora server with static ip. In box 2 fetcher runs without any errors, but the generated set is 100,000 times less than the set in box1 Thanks in advance. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Tue, Oct 16, 2012 2:40 am Subject: Re: nutch-2.0-fetcher fails in reduce stage Hi Alex, I've seen similar exceptions numerous times [0] when running the Gora test suite against HBase however this _always_ occurred against an HBase version other than the officially supported version of HBase (which is 0.90.4) when behind a local proxy so I am immediately tempted to speculate that this may be the source of the problem. On Tue, Oct 16, 2012 at 3:50 AM, alx...@aim.com wrote: at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) org.apache.gora.util.GoraException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface to master/192.168.1.4:60020 after attempts=1 The above two slices of the stack would also indicate that this is the case. bin/nutch inject works fine. Also, I have a different linux, box. fetcher with the same config runs fine, but the generated set is much less than in the first linux box. I don't really understand this very well it is quite ambiguous. Can you clearly define between box1 and box2... and which one works and which one doesn't? Also how are your HBase configurations across these boxes and how are you running Nutch? Any ideas how to fix this issue and what is the benefit running fetcher in pseudo distributed mode against the local one? Finally, is your Nutch deployment configured to run behind a proxy? I know there is no mention of this but maybe there is more to this than simply disabling iptables! I am not however HBase literate enough to comment further on what configuration causes this, therefore I've copied in the user@ gora list as well. @user@ The original thread for this topic can be found below [1] [0] http://www.mail-archive.com/dev@gora.apache.org/msg00485.html [1] http://www.mail-archive.com/user@nutch.apache.org/msg07823.html hth Lewis
Re: Same pages crawled more than once and slow crawling
Hello, I think the problem is with the storage not nutch itself. Looks like generate cannot read status or fetch time (or gets null values) from mysql. I had a bunch of issues with mysql storage and switched to hbase at the end. Alex. -Original Message- From: Sebastian Nagel wastl.na...@googlemail.com To: user user@nutch.apache.org Sent: Thu, Oct 18, 2012 12:08 pm Subject: Re: Same pages crawled more than once and slow crawling Hi Luca, I'm using Nutch 2.1 on Linux and I'm having similar problem of http://goo.gl/nrDLV, my Nutch is fetching same pages at each round. Um... I failed to reproduce the Pierre's problem with - a simpler configuration - HBase as back-end (Pierre and Luca both use mysql) Then I ran bin/nutch crawl urls -threads 1 first.htm was fetched 5 times second.htm was fetched 4 times third.htm was fetched 3 times But after the 5th cycle the crawler stopped? I tried doing each step separately (inject, generate, ...) with the same results. For Pierre this has worked... Any suggestions? Also the whole process take about 2 minutes, am I missing something about some delay config or is this normal? Well, Nutch (resp. Hadoop) are designed to process much data. Job management has some overhead (and some artificial sleeps): 5 cycles * 4 jobs (generate/fetch/parse/update) = 20 jobs. 6s per job seems roughly ok, though it could be slightly faster. Sebastian On 10/18/2012 05:55 PM, Luca Vasarelli wrote: Hello, I'm using Nutch 2.1 on Linux and I'm having similar problem of http://goo.gl/nrDLV, my Nutch is fetching same pages at each round. I've built a simple localhost site, with 3 pages linked each other: first.htm - second.htm - third.htm I did these steps: - downloaded nutch 2.1 (source) untarred to ${TEMP_NUTCH} - edited ${TEMP_NUTCH}/ivy/ivy.xml uncommenting the line about mysql backend (thanks to [1]) - edited ${TEMP_NUTCH}/conf/gora.properties removing default sql configuration and adding mysql properties (thanks to [1]) - ran ant runtime from ${TEMP_NUTCH} - moved ${TEMP_NUTCH}/runtime/local/ to /opt/${NUTCH_HOME} - edited ${NUTCH_HOME}/conf/nutch-site.xml adding http.agent.name, http.robots.agents and changing db.ignore.external.links to true and fetcher.server.delay to 0.0 - created ${NUTCH_HOME}/urls/seed.txt with http://localhost/test/first.htm; inside this file - created db table as [1] Then I ran bin/nutch crawl urls -threads 1 first.htm was fetched 5 times second.htm was fetched 4 times third.htm was fetched 3 times I tried doing each step separately (inject, generate, ...) with the same results. Also the whole process take about 2 minutes, am I missing something about some delay config or is this normal? Some extra info: - HTML of the pages: http://pastebin.com/dyDPJeZs - Hadoop log: http://pastebin.com/rwQQPnkE - nutch-site.xml: http://pastebin.com/0WArkvh5 - Wireshark log: http://pastebin.com/g4Bg17Ls - MySQL table: http://pastebin.com/gD2SvGsy [1] http://nlp.solutions.asia/?p=180
Re: Same pages crawled more than once and slow crawling
Hello, I meant that it could be a gora-mysql problem. In order to test it, you can run nutch in local mode with Generator Debug enabled. Put this log4j.logger.org.apache.nutch.crawl.GeneratorJob=DEBUG,cmdstdout in your conf/log4j.properties and run the crawl cycle with updatedb. if gora-mysql works properly, then you must see in the output, shouldFetch rejected url , fetchTime FetchTime curTime curTime for those urls that were fetched in the previous cycle. If you do not see them, then it means gora-mysql has issues. Good luck. Alex. -Original Message- From: Luca Vasarelli luca.vasare...@iit.cnr.it To: user user@nutch.apache.org Sent: Fri, Oct 19, 2012 1:01 am Subject: Re: Same pages crawled more than once and slow crawling Hi Luca, Hi Sebastian, thanks for replying! But after the 5th cycle the crawler stopped? Yes For Pierre this has worked... Any suggestions? I can post info for each step, but please tell me which log is more important: Haadop log? MySQL table? If this last one, which fields? Alex says it's a MySQL problem, how can I verify after the generate step if he is correct? Well, Nutch (resp. Hadoop) are designed to process much data. Job management has some overhead (and some artificial sleeps): 5 cycles * 4 jobs (generate/fetch/parse/update) = 20 jobs. 6s per job seems roughly ok, though it could be slightly faster. Yes, this test is not well designed for Nutch, but I thought, as Stefan said, about a config or hardcoded delay somewhere in the nutch files I can try to reduce, since I will use on a single machine. Luca
Re: Image search engine based on nutch/solr
Hello, I have also written this kind of plugin. But instead of putting thumbnail files in solr index they are put in a folder. Only, filenames are kept in the solr index. I wondered what is the advantage of putting thumbnail files in the solr index? Thanks in advance. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Sun, Oct 21, 2012 7:26 pm Subject: Re: Image search engine based on nutch/solr Hi, As Lewis say before, if you are going to use nutch for image retrieval and indexing in solr, you'll need to invest some time writing some tools depending on your needs. I've been working on a search engine using nutch for the crawling process and solr as an indexing server, the typical use, when we start dealing with images we became aware that nutch (through the tike project) extract to few information about the image per se (basically only metadata, gets extracted), I think that this is the biggest problem with nutch. One particular requirement for me was to show a thumbnail of the image, so I wrote a plugin that generates the thumbnail, then encode it using base64 and store it in the solr index. Other need was to annotate the image with the surrounding text to improve the search, I also write a plugin for this. Summarizing, nutch it's a very good start point, but depending on your particular needs you'll have to write some plugins on your own. Greetings On Oct 20, 2012, at 10:02 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi, On Fri, Oct 19, 2012 at 10:48 PM, Santosh Mahto santosh.inb...@gmail.com wrote: Hi all I have few question: 1. Does nutch support images crawling and indexing(or how much support is there) Depending on how you wish to process and then present your images e.g. as thumbnails for example, I would say you need to invest some time writing a custom parser for images. You can read a pretty thorough and comprehensive thread [0] on this topic. 2. As I got some link where apache-tika plugin is used to make image search engine, with little exploration i found tikka is defaulted in nutch(as I think ,not sure) . so is image seaching also happens by default. Image processing and indexing is not enabled my default in the above context 3. As I think i also need to configure solr to show the image result . could you guide me what extra configuration need to be set in solr side Unless someone here who has worked with image indexing in Solr can help you in a more verbose manner than me, I would certainly direct you to thee solr-user@ list archives [1]. There appears to be plenty there. hth Lewis [0] http://www.mail-archive.com/user@nutch.apache.org/msg06758.html [1] http://www.mail-archive.com/search?q=imagel=solr-user%40lucene.apache.org 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails
Hi, I think in order to be sure that this is gora-sql problem, you need to do the same crawling with nutch/hbase. It must not take much time if you run it in local mode. Simply install hbase and follow quick start tutorial. Alex. -Original Message- From: kiran chitturi chitturikira...@gmail.com To: user user@nutch.apache.org Sent: Thu, Nov 1, 2012 9:29 am Subject: Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails Hi, I have created an issue (https://issues.apache.org/jira/browse/NUTCH-1487). Do you think this is because of the SQL backend ? Its failing for PDF files but working for HTML files. Can the problem be due to some bug in the tika.parser code (since tika plugin handles the PDF parsing) ? I am interesting in fixing this problem, if i can find out where the issue starts. Does anyone have inputs for this ? Thanks, Kiran. On Thu, Nov 1, 2012 at 10:15 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Yes please do open an issue. The docs should be parsed in one go and I suspect (yet another) issue with the SQL backend Thanks J On 1 November 2012 13:48, kiran chitturi chitturikira...@gmail.com wrote: Thank you alxsss for the suggestion. It displays the actualSize and inHeaderSize for every file and two more lines in logs but it did not much information even when i set parserJob to Debug. I had the same problem when i re-compiled everything today. I have to run the parse command multiple times to get all the files parsed. I am using SQL with GORA. Its mysql database. For now, atleast the files are getting parsed, do i need to open a issue for this ? Thank you, Regards, Kiran. On Wed, Oct 31, 2012 at 4:36 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Kiran Interesting. Which backend are you using with GORA? The SQL one? Could be a problem at that level Julien On 31 October 2012 17:01, kiran chitturi chitturikira...@gmail.com wrote: Hi Julien, I have just noticed something when running the parse. First when i ran the parse command 'sh bin/nutch parse 1351188762-1772522488', the parsing of all the PDF files has failed. When i ran the command again one pdf file got parsed. Next time, another pdf file got parsed. When i ran the parse command the number of times the total number of pdf files, all the pdf files got parsed. In my case, i ran it 17 times and all the pdf files are parsed. Before that, not everything is parsed. This sounds strange, do you think it is some configuration problem ? I have tried this 2 times and same thing happened two times for me . I am not sure why this is happening. Thanks for your help. Regards, Kiran. On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi Sorry about that. I did not notice the parsecodes are actually nutch and not tika. no problems! The setup is local on Mac desktop and i am using through command line and remote debugging through eclipse ( http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse ). OK I have set both http.content.limit and file.content.limit to -1. The logs just say 'WARN parse.ParseUtil - Unable to successfully parse content http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of type application/pdf'. you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? (not in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean runtime') All the html's are getting parsed and when i crawl this page ( http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's and some of the pdf files get parsed. Like, half of the pdf files get parsed and the other half don't get parsed. do the ones that are not parsed have something in common? length? I am not sure about what causing the problem as you said parsechecker is actually work. I want the parser to crawl the full-text of the pdf and the metadata, title. OK The metatags are also getting crawled for failed pdf parsing. They would be discarded because of the failure even if they were successfully extracted indeed. The current mechanism does not cater for semi-failures J. -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Kiran Chitturi -- * *Open Source Solutions for Text Engineering
Re: Access crawled content or parsed data of previous crawled url
It is not clear what you try to achieve. We have done something similar in regard of indexing img tags. We retrieve img tag data while parsing the html page and keep it in a metadata and when parsing img url itself we create thumbnail. hth. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Wed, Nov 28, 2012 2:58 pm Subject: Re: Access crawled content or parsed data of previous crawled url Any documentation about crawldb api? I'm guessing the it shouldn't be so hard to retrieve a documento by it's url (which is basically what I need. I'm also open to any suggestion on this matter, so If any one has done something similar or has any thoughts on this and can share it, I'll be very grateful. Greetings! - Mensaje original - De: Stefan Scheffler sscheff...@avantgarde-labs.de Para: user@nutch.apache.org Enviados: Miércoles, 28 de Noviembre 2012 15:04:44 Asunto: Re: Access crawled content or parsed data of previous crawled url Hi, I think, this is possible, because you can write a ParserPlugin which access the allready stored documents via the segments- /crawldb api. But i´m not sure how it will work exactly. Regards Stefan Re Am 28.11.2012 20:59, schrieb Jorge Luis Betancourt Gonzalez: Hi: For what I've seen in nutch plugins exist the philosophy of one NutchDocument per url, but I was wondering if there is any way of accessing parsed/crawled content of a previous fetched/parsed url, let's say for instance that I've a HTML page with an image embedded: So the start point will be http://host.com/test.html which is the first document that get's fetched/parsed then the OutLink extractor will detect the embedded image inside test.html and then add the url in the src attribute of the img tag, so then the image url will be fetched and then parsed. My question: Is possible, when the image is getting parsed, to access the content and parsed data of test.html? I'm trying to add some data present on the HTML page as a new metadata field of the image, and I'm not quite sure on how to accomplish this. Greetings in advance! 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Stefan Scheffler Avantgarde Labs GbR Löbauer Straße 19, 01099 Dresden Telefon: + 49 (0) 351 21590834 Email: sscheff...@avantgarde-labs.de 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Access crawled content or parsed data of previous crawled url
Hi, Unfortunately, my employer does not want me to disclose details of the plugin at this time. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Wed, Nov 28, 2012 6:20 pm Subject: Re: Access crawled content or parsed data of previous crawled url Hi Alex: What you've done is basically what I'm try to accomplish: I'm trying to get the text surrounding the img tags to improve the image search engine we're building (this is done when the html page containing the img tag is parsed), and when the image url itself is parsed we generate thumbnails and extract some metadata. But how do you keep the this 2 pieces of data linked together inside your index (solr in my case). Because the thing is that I'm getting two documents inside solr (1. containing the text surrounding the img tag, and other document with the thumbnail). So what brings me troubles is how when the thumbnail is being generated can I get the surrounding text detecte when the html was parsed? Thanks a lot for all the replies! P.S: Alex, can you share some piece of code (if it's possible) of your working plugins? Or walk me through what you've came up with? - Mensaje original - De: alx...@aim.com Para: user@nutch.apache.org Enviados: Miércoles, 28 de Noviembre 2012 19:54:07 Asunto: Re: Access crawled content or parsed data of previous crawled url It is not clear what you try to achieve. We have done something similar in regard of indexing img tags. We retrieve img tag data while parsing the html page and keep it in a metadata and when parsing img url itself we create thumbnail. hth. Alex. -Original Message- From: Jorge Luis Betancourt Gonzalez jlbetanco...@uci.cu To: user user@nutch.apache.org Sent: Wed, Nov 28, 2012 2:58 pm Subject: Re: Access crawled content or parsed data of previous crawled url Any documentation about crawldb api? I'm guessing the it shouldn't be so hard to retrieve a documento by it's url (which is basically what I need. I'm also open to any suggestion on this matter, so If any one has done something similar or has any thoughts on this and can share it, I'll be very grateful. Greetings! - Mensaje original - De: Stefan Scheffler sscheff...@avantgarde-labs.de Para: user@nutch.apache.org Enviados: Miércoles, 28 de Noviembre 2012 15:04:44 Asunto: Re: Access crawled content or parsed data of previous crawled url Hi, I think, this is possible, because you can write a ParserPlugin which access the allready stored documents via the segments- /crawldb api. But i´m not sure how it will work exactly. Regards Stefan Re Am 28.11.2012 20:59, schrieb Jorge Luis Betancourt Gonzalez: Hi: For what I've seen in nutch plugins exist the philosophy of one NutchDocument per url, but I was wondering if there is any way of accessing parsed/crawled content of a previous fetched/parsed url, let's say for instance that I've a HTML page with an image embedded: So the start point will be http://host.com/test.html which is the first document that get's fetched/parsed then the OutLink extractor will detect the embedded image inside test.html and then add the url in the src attribute of the img tag, so then the image url will be fetched and then parsed. My question: Is possible, when the image is getting parsed, to access the content and parsed data of test.html? I'm trying to add some data present on the HTML page as a new metadata field of the image, and I'm not quite sure on how to accomplish this. Greetings in advance! 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci -- Stefan Scheffler Avantgarde Labs GbR Löbauer Straße 19, 01099 Dresden Telefon: + 49 (0) 351 21590834 Email: sscheff...@avantgarde-labs.de 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci
Re: Native Hadoop library not loaded and Cannot parse sites contents
move or copy that jar file to local/lib and try again. hth. Alex. -Original Message- From: Arcondo arcondo.dasi...@gmail.com To: user user@nutch.apache.org Sent: Fri, Jan 4, 2013 2:55 am Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents Hope that now you can see them Plugin folder http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png Parse Job http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png Parse error : Hadoop.log http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png My nutch-site.xm (plugin includes) property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property -- View this message in context: http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Native Hadoop library not loaded and Cannot parse sites contents
Which version of nutch is this? Did you follow the tutorial? I can help yuu if you provide all steps you did, starting with downloading nutch. Alex. -Original Message- From: Arcondo Dasilva arcondo.dasi...@gmail.com To: user user@nutch.apache.org Sent: Fri, Jan 4, 2013 1:23 pm Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents Hi Alex, I tried. That was the first thing I did but without success. I don't understand why I'm obliged to use Neko instead of Tika. As far as I know tika can parse more than 1200 different formats Kr, Arcondo On Fri, Jan 4, 2013 at 7:47 PM, alx...@aim.com wrote: move or copy that jar file to local/lib and try again. hth. Alex. -Original Message- From: Arcondo arcondo.dasi...@gmail.com To: user user@nutch.apache.org Sent: Fri, Jan 4, 2013 2:55 am Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents Hope that now you can see them Plugin folder http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png Parse Job http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png Parse error : Hadoop.log http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png My nutch-site.xm (plugin includes) property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property -- View this message in context: http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Native Hadoop library not loaded and Cannot parse sites contents
Hi, You can unjar the jar file, check if the class that parse complains about is inside it. You can also try to put content of jar file under local /lib. Maybe there is some read restriction. If this does not help, I can only suggest to start again with a new copy of nutch. Alex. -Original Message- From: Arcondo Dasilva arcondo.dasi...@gmail.com To: user user@nutch.apache.org Sent: Sat, Jan 5, 2013 1:11 am Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents Hi Alex, I'm using 2.1 version / hbase 0.90.6 / solr 4.0 everything works fine except I'm not able to parse the contents of my url because of the error Nekohtml not found. my plugins include looks like this : valueprotocol-http|urlfilter-regex|parse-(xml|xhtml|html|tika|text|js)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|lib-nekohtml/value I added lib-nekohtml at the end of the allowed values but seems that has no effect on the error. in my runtime/local/plugins/lib-nekohtml, I have the jar file : nekohtml-0.9.5.jar is there something I should look for beside this ? Thanks a lot for your help. Kr, Arcondo On Fri, Jan 4, 2013 at 11:33 PM, alx...@aim.com wrote: Which version of nutch is this? Did you follow the tutorial? I can help yuu if you provide all steps you did, starting with downloading nutch. Alex. -Original Message- From: Arcondo Dasilva arcondo.dasi...@gmail.com To: user user@nutch.apache.org Sent: Fri, Jan 4, 2013 1:23 pm Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents Hi Alex, I tried. That was the first thing I did but without success. I don't understand why I'm obliged to use Neko instead of Tika. As far as I know tika can parse more than 1200 different formats Kr, Arcondo On Fri, Jan 4, 2013 at 7:47 PM, alx...@aim.com wrote: move or copy that jar file to local/lib and try again. hth. Alex. -Original Message- From: Arcondo arcondo.dasi...@gmail.com To: user user@nutch.apache.org Sent: Fri, Jan 4, 2013 2:55 am Subject: Re: Native Hadoop library not loaded and Cannot parse sites contents Hope that now you can see them Plugin folder http://lucene.472066.n3.nabble.com/file/n4030524/plugin_folder.png Parse Job http://lucene.472066.n3.nabble.com/file/n4030524/parse_job.png Parse error : Hadoop.log http://lucene.472066.n3.nabble.com/file/n4030524/parse_error.png My nutch-site.xm (plugin includes) property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property -- View this message in context: http://lucene.472066.n3.nabble.com/Native-Hadoop-library-not-loaded-and-Cannot-parse-sites-contents-tp4029542p4030524.html Sent from the Nutch - User mailing list archive at Nabble.com.
nutch/util/NodeWalker class is not thread safe
Hello, I use this class NodeWalker at src/java/org/apache/nutch/util/NodeWalker.java in one of our plugins. I noticed this comment //Currently this class is not thread safe. It is assumed that only one thread will be accessing the codeNodeWalker/code at any given time. above the class definition. Any ideas if this can cause problems and how to make it thread safe? Thanks. Alex.
Re: Nutch 2.0 updatedb and gora query
I see that inlinks are saved as ol in hbase. Alex. -Original Message- From: kiran chitturi chitturikira...@gmail.com To: user user@nutch.apache.org Sent: Wed, Jan 30, 2013 9:31 am Subject: Re: Nutch 2.0 updatedb and gora query Link to the reference ( http://lucene.472066.n3.nabble.com/Inlinks-not-being-saved-in-the-database-td4037067.html) and jira (https://issues.apache.org/jira/browse/NUTCH-1524) On Wed, Jan 30, 2013 at 12:25 PM, kiran chitturi chitturikira...@gmail.comwrote: Hi, I have posted a similar issue in dev list [0]. The problem comes with inlinks not being saved to database even though they are added to the webpage object. I am curious about what happens after the fields are saved in the webpage object. How are they sent to Gora ? Which class is used to communicate with Gora ? I have seen Storage Utils class but i want to know if its the only class that is used to communicate with databases. Please let me know your suggestions. I feel, the inlinks are not being saved due to small problem in the code. [0] - http://mail-archives.apache.org/mod_mbox/nutch-dev/201301.mbox/browser Thanks, -- Kiran Chitturi -- Kiran Chitturi
Re: Nutch 2.0 updatedb and gora query
What do you call inlinks? I call inlink for mysite.com all urls as mysite.com/myhtml1.html, mysite.com/myhtml2.html and etc. Currently they are saved as ol in hbase. from hbase shell do this get 'webpage', 'com.mysite:http/' and check what ol family looks like. I have these config property namedb.ignore.external.links/name valuetrue/value /property property namedb.ignore.internal.links/name valuetrue/value /property Alex. -Original Message- From: kiran chitturi chitturikira...@gmail.com To: user user@nutch.apache.org Sent: Wed, Jan 30, 2013 11:11 am Subject: Re: Nutch 2.0 updatedb and gora query I have checked the database after the dbupdate job is ran and i could see only markers, signature and fetch fields. The initial seed which was crawled and parsed, has only outlinks. I notice one of the outlink is actually the inlink. Aren't inlinks supposed to be saved during the dbUpdatedJob ? When i tried to debug, i could see in eclipse and in the dbUpdateReducer job that the inlinks are being saved to the page object along with fetch fields, markers but i did not understood where the data is going from there. Is the data written to Hbase during the dbUpdateReducer job ? Thanks, Kiran. On Wed, Jan 30, 2013 at 1:43 PM, alx...@aim.com wrote: I see that inlinks are saved as ol in hbase. Alex. -Original Message- From: kiran chitturi chitturikira...@gmail.com To: user user@nutch.apache.org Sent: Wed, Jan 30, 2013 9:31 am Subject: Re: Nutch 2.0 updatedb and gora query Link to the reference ( http://lucene.472066.n3.nabble.com/Inlinks-not-being-saved-in-the-database-td4037067.html ) and jira (https://issues.apache.org/jira/browse/NUTCH-1524) On Wed, Jan 30, 2013 at 12:25 PM, kiran chitturi chitturikira...@gmail.comwrote: Hi, I have posted a similar issue in dev list [0]. The problem comes with inlinks not being saved to database even though they are added to the webpage object. I am curious about what happens after the fields are saved in the webpage object. How are they sent to Gora ? Which class is used to communicate with Gora ? I have seen Storage Utils class but i want to know if its the only class that is used to communicate with databases. Please let me know your suggestions. I feel, the inlinks are not being saved due to small problem in the code. [0] - http://mail-archives.apache.org/mod_mbox/nutch-dev/201301.mbox/browser Thanks, -- Kiran Chitturi -- Kiran Chitturi -- Kiran Chitturi
Re: Nutch 1.6 +solr 4.1.0
Hi, Not sure about solrdedup, but solrindex worked for me in nutch-1.4 with solr-4.1.0. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 6, 2013 6:13 pm Subject: Re: Nutch 1.6 +solr 4.1.0 Hi, We are not good to go with Solr 4.1 yet. There are changes required to schema.xml as well as the indexer package in nutch to accommodate api changes in 4.1. Please check our Jira for these issues. I am happy to help with the update however it will block some other proposed changes to the pluggable indexers... On Wednesday, February 6, 2013, Mustafa Elkhiat melkh...@gmail.com wrote: i crawl website by this command bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 10 -topN 10 but i faced this exception how to fix it error? SolrIndexer: starting at 2013-02-07 03:02:07 SolrIndexer: deleting gone documents: false SolrIndexer: URL filtering: false SolrIndexer: URL normalizing: false org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused SolrDeleteDuplicates: starting at 2013-02-07 03:02:29 SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ Exception in thread main java.io.IOException: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981) at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:198) ... 16 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.init(Socket.java:425) at java.net.Socket.init(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:422) ... 20 more -- *Lewis*
Re: Nutch 2.1 + HBase cluster settings
Hi, So, you do not run hadoop and nutch job works in distributed mode? Thanks. Alex. -Original Message- From: k4200 k4...@kazu.tv To: user user@nutch.apache.org Sent: Wed, Feb 6, 2013 7:43 pm Subject: Re: Nutch 2.1 + HBase cluster settings Hi Lewis, There seems to be a bug in HBase 0.90.4 library, which comes with Nutch. I replaced hbase-0.90.4.jar with hbase-0.90.6-cdh3u5.jar and the problem resolved. Regards, Kaz 2013/2/7 Lewis John Mcgibbney lewis.mcgibb...@gmail.com: Please let us know how you get on as we can add this to the 2.x errors section of the wiki. Thanks and good luck with the problem. Lewis On Wed, Feb 6, 2013 at 4:45 PM, k4200 k4...@kazu.tv wrote: Hi Lewis, Thanks for your reply. 2013/2/7 Lewis John Mcgibbney lewis.mcgibb...@gmail.com: Hi, On Wednesday, February 6, 2013, k4200 k4...@kazu.tv wrote: Q1. My first question is how to fix this issue? Do I need any other settings fo Nutch to utilize an HBase cluster correctly? In short, I would personally shoot over the hbase lists. As you mention, the ZK connections have been increased but you are still experiencing similar results. Did you mention which HBase dist you are using? Sorry. I should have mentioned this in the previous email. I use CDH3 Update 5 on CentOS 6.3, so HBase 0.90.6 with some patches. I'll ask the HBase list as well. Q2. The second question is about Nutch and Hadoop. I didn't install Hadoop Job Tracker and Task Tracker because HBase itself doesn't need them according to a SO question [2], but does Nutch need them for some types of jobs? No running Hadoop in pseudo or distrib mode is not a pre requisite for running Nutch successfully but it can be extremely helpful not only because you get the web app navigation over job control. In the instance that Nutch is being run without Hadoop JT and TT (e.g. local mode) it simply relies upon the hadoop library pulled via Ivy. Thanks for the clarification. I'll run JT and TT. I looked for some documents or diagrams that describe the overall architecture of Nutch with Gora and HBase, but couldn't find a good one. Mmm. What exactly are you looking for here? We have various articles here [0] which explain quite a bit to get you started. Inevitably there is no substitute better than looking into the code and unfortunately we don't have any diagrams as such. One resources which may be of interest (regarding the Gora API and relevant layers) can be found in last years GSoC project reports [1]. There are some Gora architecture class diagrams available there, however I warn that (latterly) they introduce the Gora Web Services API which was written into the current 0.3 development code. Thanks for the pointers. And, you're right. I'll look into the code, too. Thanks, Kaz hth somewhat though. Lewis [0] http://wiki.apache.org/nutch/#Nutch_2.x [1] http://svn.apache.org/repos/asf/gora/committers/reporting/ -- *Lewis*
Re: Nutch identifier while indexing.
Are you telling that your sites have form siteA.mydomain.com, siteB.mydomain.com, siteC.mydomain.com? Alex. -Original Message- From: mbehlok m_beh...@hotmail.com To: user user@nutch.apache.org Sent: Wed, Feb 13, 2013 11:05 am Subject: Nutch identifier while indexing. Hello, I am indexing 3 sites: SiteA SiteB SiteC I want to index these sites in a way that when searching them in solr I can query a search on each of these sites in separate. So one could say... thats easy, just filter them by host... WRONG... Sites are hosted on the same host but have different starting points. That is, starting the crawl from different root urls (SiteA, SiteB, SiteC) produces different results. My imagination tells me to somehow specify an identifier on schema.xml that passes to solr which was the root url that produced that crawl. Any ideas on how to implement this? any variations? Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-identifier-while-indexing-tp4040285.html Sent from the Nutch - User mailing list archive at Nabble.com.
nutch cannot retrive title and inlinks of a domain
Hello, I noticed that nutch cannot retrieve title and inlinks of one of the domains in the seed list. However, if I run identical code from the server where this domain is hosted then it correctly parses it. The surprising thing is that in both cases this urls has status: 2 (status_fetched) parseStatus:success/ok (1/0), args=[] I used nutch-2.1 with hbase-0.92.1 and nutch 1.4. Any ideas why this happens? Thanks. Alex.
Re: nutch cannot retrive title and inlinks of a domain
Hi, I noticed that for other urls in the seed inlinks are saved as ol. I checked the code and figured out that this is done with the part that saves anchors. So, in my case inlinks are saved as anchors in the field ol in hbase. But, for one of the ulrs, titile and inlinks are not retrieved, although its parse status marked success/ok (1/0), args=[]. Alex. -Original Message- From: kiran chitturi chitturikira...@gmail.com To: user user@nutch.apache.org Sent: Wed, Feb 13, 2013 12:40 pm Subject: Re: nutch cannot retrive title and inlinks of a domain Hi Alex, Inlinks does not work with me now for the same domain [0] currently. I am using Nutch-2.x and Hbase. Does the inlinks get saved for you for some of the crawl seeds ? Surprising, the title does not get saved. Did you try using parsechecker ? [0] - http://www.mail-archive.com/user@nutch.apache.org/msg08627.html On Wed, Feb 13, 2013 at 3:26 PM, alx...@aim.com wrote: Hello, I noticed that nutch cannot retrieve title and inlinks of one of the domains in the seed list. However, if I run identical code from the server where this domain is hosted then it correctly parses it. The surprising thing is that in both cases this urls has status: 2 (status_fetched) parseStatus:success/ok (1/0), args=[] I used nutch-2.1 with hbase-0.92.1 and nutch 1.4. Any ideas why this happens? Thanks. Alex. -- Kiran Chitturi
fields in solrindex-mapping.xml
Hello, I see that there are field dest=segment source=segment/ field dest=boost source=boost/ field dest=digest source=digest/ field dest=tstamp source=tstamp/ fields in addition to title, host and content ones in nutch-2.x' solr-mapping.xml. I thought tstamp may be needed for sorting documents. What about the other fields, segment, boost and digest? Can someone explain, why these fields are included in solr-mapping.xml? Thanks. Alex.