Hi, I think in order to be sure that this is gora-sql problem, you need to do the same crawling with nutch/hbase. It must not take much time if you run it in local mode. Simply install hbase and follow quick start tutorial.
Alex. -----Original Message----- From: kiran chitturi <chitturikira...@gmail.com> To: user <user@nutch.apache.org> Sent: Thu, Nov 1, 2012 9:29 am Subject: Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails Hi, I have created an issue (https://issues.apache.org/jira/browse/NUTCH-1487). Do you think this is because of the SQL backend ? Its failing for PDF files but working for HTML files. Can the problem be due to some bug in the tika.parser code (since tika plugin handles the PDF parsing) ? I am interesting in fixing this problem, if i can find out where the issue starts. Does anyone have inputs for this ? Thanks, Kiran. On Thu, Nov 1, 2012 at 10:15 AM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi > > Yes please do open an issue. The docs should be parsed in one go and I > suspect (yet another) issue with the SQL backend > > Thanks > > J > > On 1 November 2012 13:48, kiran chitturi <chitturikira...@gmail.com> > wrote: > > > Thank you alxsss for the suggestion. It displays the actualSize and > > inHeaderSize for every file and two more lines in logs but it did not > much > > information even when i set parserJob to Debug. > > > > I had the same problem when i re-compiled everything today. I have to run > > the parse command multiple times to get all the files parsed. > > > > I am using SQL with GORA. Its mysql database. > > > > For now, atleast the files are getting parsed, do i need to open a issue > > for this ? > > > > Thank you, > > > > Regards, > > Kiran. > > > > > > On Wed, Oct 31, 2012 at 4:36 PM, Julien Nioche < > > lists.digitalpeb...@gmail.com> wrote: > > > > > Hi Kiran > > > > > > Interesting. Which backend are you using with GORA? The SQL one? Could > > be a > > > problem at that level > > > > > > Julien > > > > > > On 31 October 2012 17:01, kiran chitturi <chitturikira...@gmail.com> > > > wrote: > > > > > > > Hi Julien, > > > > > > > > I have just noticed something when running the parse. > > > > > > > > First when i ran the parse command 'sh bin/nutch parse > > > > 1351188762-1772522488', the parsing of all the PDF files has failed. > > > > > > > > When i ran the command again one pdf file got parsed. Next time, > > another > > > > pdf file got parsed. > > > > > > > > When i ran the parse command the number of times the total number of > > pdf > > > > files, all the pdf files got parsed. > > > > > > > > In my case, i ran it 17 times and all the pdf files are parsed. > Before > > > > that, not everything is parsed. > > > > > > > > This sounds strange, do you think it is some configuration problem ? > > > > > > > > I have tried this 2 times and same thing happened two times for me . > > > > > > > > I am not sure why this is happening. > > > > > > > > Thanks for your help. > > > > > > > > Regards, > > > > Kiran. > > > > > > > > > > > > On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche < > > > > lists.digitalpeb...@gmail.com> wrote: > > > > > > > > > Hi > > > > > > > > > > > > > > > > Sorry about that. I did not notice the parsecodes are actually > > nutch > > > > and > > > > > > not tika. > > > > > > > > > > > > no problems! > > > > > > > > > > > > > > > > The setup is local on Mac desktop and i am using through command > > line > > > > and > > > > > > remote debugging through eclipse ( > > > > > > > > > > > > > > > > > > > > > http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse > > > > > > ). > > > > > > > > > > > > > > > > OK > > > > > > > > > > > > > > > > > I have set both http.content.limit and file.content.limit to -1. > > The > > > > logs > > > > > > just say 'WARN parse.ParseUtil - Unable to successfully parse > > > content > > > > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of > > > type > > > > > > application/pdf'. > > > > > > > > > > > > > > > > you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right? > > (not > > > > > in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean > > runtime') > > > > > > > > > > > > > > > > > > > > > > All the html's are getting parsed and when i crawl this page ( > > > > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's > > and > > > > > some > > > > > > of the pdf files get parsed. Like, half of the pdf files get > parsed > > > and > > > > > the > > > > > > other half don't get parsed. > > > > > > > > > > > > > > > > do the ones that are not parsed have something in common? length? > > > > > > > > > > > > > > > > I am not sure about what causing the problem as you said > > parsechecker > > > > is > > > > > > actually work. I want the parser to crawl the full-text of the > pdf > > > and > > > > > the > > > > > > metadata, title. > > > > > > > > > > > > > > > > OK > > > > > > > > > > > > > > > > > > > > > > The metatags are also getting crawled for failed pdf parsing. > > > > > > > > > > > > > > > > They would be discarded because of the failure even if they > > > > > were successfully extracted indeed. The current mechanism does not > > > cater > > > > > for semi-failures > > > > > > > > > > J. > > > > > > > > > > -- > > > > > * > > > > > *Open Source Solutions for Text Engineering > > > > > > > > > > http://digitalpebble.blogspot.com/ > > > > > http://www.digitalpebble.com > > > > > http://twitter.com/digitalpebble > > > > > > > > > > > > > > > > > > > > > -- > > > > Kiran Chitturi > > > > > > > > > > > > > > > > -- > > > * > > > *Open Source Solutions for Text Engineering > > > > > > http://digitalpebble.blogspot.com/ > > > http://www.digitalpebble.com > > > http://twitter.com/digitalpebble > > > > > > > > > > > -- > > Kiran Chitturi > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Kiran Chitturi