nutch parse fails

alxsss Thu, 01 Nov 2012 10:50:12 -0700

Hi,

I think in order to be sure that this is gora-sql problem, you need to do the 
same crawling with nutch/hbase. It must not take much time if you run it in 
local mode. Simply install hbase and follow quick start tutorial.


Alex.

 

 

 

-----Original Message-----
From: kiran chitturi <chitturikira...@gmail.com>
To: user <user@nutch.apache.org>
Sent: Thu, Nov 1, 2012 9:29 am
Subject: Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails


Hi,

I have created an issue (https://issues.apache.org/jira/browse/NUTCH-1487).

Do you think this is because of the SQL backend ? Its failing for PDF files
but working for HTML files.

Can the problem be due to some bug in the tika.parser code (since tika
plugin handles the PDF parsing) ?

I am interesting in fixing this problem, if i can find out where the issue
starts.

Does anyone have inputs for this ?

Thanks,
Kiran.



On Thu, Nov 1, 2012 at 10:15 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi
>
> Yes please do open an issue. The docs should be parsed in one go and I
> suspect (yet another) issue with the SQL backend
>
> Thanks
>
> J
>
> On 1 November 2012 13:48, kiran chitturi <chitturikira...@gmail.com>
> wrote:
>
> > Thank you alxsss for the suggestion. It displays the actualSize and
> > inHeaderSize for every file and two more lines in logs but it did not
> much
> > information even when i set parserJob to Debug.
> >
> > I had the same problem when i re-compiled everything today. I have to run
> > the parse command multiple times to get all the files parsed.
> >
> > I am using SQL with GORA. Its mysql database.
> >
> > For now, atleast the files are getting parsed, do  i need to open a issue
> > for this ?
> >
> > Thank you,
> >
> > Regards,
> > Kiran.
> >
> >
> > On Wed, Oct 31, 2012 at 4:36 PM, Julien Nioche <
> > lists.digitalpeb...@gmail.com> wrote:
> >
> > > Hi Kiran
> > >
> > > Interesting. Which backend are you using with GORA? The SQL one? Could
> > be a
> > > problem at that level
> > >
> > > Julien
> > >
> > > On 31 October 2012 17:01, kiran chitturi <chitturikira...@gmail.com>
> > > wrote:
> > >
> > > > Hi Julien,
> > > >
> > > > I have just noticed something when running the parse.
> > > >
> > > > First when i ran the parse command 'sh bin/nutch parse
> > > > 1351188762-1772522488', the parsing of all the PDF files has failed.
> > > >
> > > > When i ran the command again one pdf file got parsed. Next time,
> > another
> > > > pdf file got parsed.
> > > >
> > > > When i ran the parse command the number of times the total number of
> > pdf
> > > > files, all the pdf files got parsed.
> > > >
> > > > In my case,  i ran it 17 times and all the pdf files are parsed.
> Before
> > > > that, not everything is parsed.
> > > >
> > > > This sounds strange, do you think it is some configuration problem ?
> > > >
> > > > I have tried this 2 times and same thing happened two times for me .
> > > >
> > > > I am not sure why this is happening.
> > > >
> > > > Thanks for your help.
> > > >
> > > > Regards,
> > > > Kiran.
> > > >
> > > >
> > > > On Wed, Oct 31, 2012 at 10:28 AM, Julien Nioche <
> > > > lists.digitalpeb...@gmail.com> wrote:
> > > >
> > > > > Hi
> > > > >
> > > > >
> > > > > > Sorry about that. I did not notice the parsecodes are actually
> > nutch
> > > > and
> > > > > > not tika.
> > > > > >
> > > > > > no problems!
> > > > >
> > > > >
> > > > > > The setup is local on Mac desktop and i am using through command
> > line
> > > > and
> > > > > > remote debugging through eclipse (
> > > > > >
> > > > >
> > > >
> > >
> >
> http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse
> > > > > > ).
> > > > > >
> > > > >
> > > > > OK
> > > > >
> > > > > >
> > > > > > I have set both http.content.limit and file.content.limit to -1.
> > The
> > > > logs
> > > > > > just say 'WARN  parse.ParseUtil - Unable to successfully parse
> > > content
> > > > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/watson.pdf of
> > > type
> > > > > > application/pdf'.
> > > > > >
> > > > >
> > > > > you set it in $NUTCH_HOME/runtime/local/conf/nutch-site.xml right?
> > (not
> > > > > in $NUTCH_HOME/conf/nutch-site.xml unless you call 'ant clean
> > runtime')
> > > > >
> > > > >
> > > > > >
> > > > > > All the html's are getting parsed and when i crawl this page (
> > > > > > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/), all the html's
> > and
> > > > > some
> > > > > > of the pdf files get parsed. Like, half of the pdf files get
> parsed
> > > and
> > > > > the
> > > > > > other half don't get parsed.
> > > > > >
> > > > >
> > > > > do the ones that are not parsed have something in common? length?
> > > > >
> > > > >
> > > > > > I am not sure about what causing the problem as you said
> > parsechecker
> > > > is
> > > > > > actually work. I want the parser to crawl the full-text of the
> pdf
> > > and
> > > > > the
> > > > > > metadata, title.
> > > > > >
> > > > >
> > > > > OK
> > > > >
> > > > >
> > > > > >
> > > > > > The metatags are also getting crawled for failed pdf parsing.
> > > > > >
> > > > >
> > > > > They would be discarded because of the failure even if they
> > > > > were successfully extracted indeed. The current mechanism does not
> > > cater
> > > > > for semi-failures
> > > > >
> > > > > J.
> > > > >
> > > > > --
> > > > > *
> > > > > *Open Source Solutions for Text Engineering
> > > > >
> > > > > http://digitalpebble.blogspot.com/
> > > > > http://www.digitalpebble.com
> > > > > http://twitter.com/digitalpebble
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kiran Chitturi
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > > http://twitter.com/digitalpebble
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
Kiran Chitturi

Re: bin/nutch parsechecker -dumpText works but bin/nutch parse fails

Reply via email to