Re: Nutch 2.x parse MajorCode, MinorCode

Julien Nioche Tue, 30 Oct 2012 08:07:29 -0700

*./nutch parsechecker -D http.agent.name="tralala" -D http.content.limit=-1
-dumpText http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf*


works absolutely fine in both the trunk and 2.x branch. try from the
runtime/local/bin directory and check the logs for more details

On 30 October 2012 13:54, kiran chitturi <chitturikira...@gmail.com> wrote:

> Interestingly, the tika jar i have downloaded separately is able to parse
> all the text from the pdf files while the nutch tika parser is failing for
> some of the files. I have set the content.limit to -1.
>
> The error message is '2012-10-30 09:30:37,382 WARN  parse.ParseUtil -
> Unable to successfully parse content
> http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf of type
> application/pdf'
>
> for the failed pdf files. I could see some title and text when i am
> debugging in Eclipse but i could see it failing due to the parseCodes.
>
> Thank you.
> Kiran
>
> On Tue, Oct 30, 2012 at 8:58 AM, kiran chitturi
> <chitturikira...@gmail.com>wrote:
>
> > Hi
> >
> > I did not sent the content limit to -1 but i have set it high enough to
> be
> > able to go through the documents that i am parsing. I could see some
> title
> > and text but i am not sure how much it is able to do. I am gonna try
> using
> > tika separately and try to process the documents. If all of it goes
> through
> > tika-1.2 separately then i have to try to debug where i am getting the
> > error here.
> >
> > Many Thanks,
> > Kiran.
> >
> >
> > On Tue, Oct 30, 2012 at 4:37 AM, Julien Nioche <
> > lists.digitalpeb...@gmail.com> wrote:
> >
> >> Hi
> >>
> >> Look at the code for the class ParseStatusCodes. This simply indicates
> >> that
> >> the parsing failed and is not the cause for the failing itself. Do you
> get
> >> the entire text for the document or just what the parser managed to
> >> process
> >> until it failed? Did you set the content limit to -1?
> >>
> >> Thanks
> >>
> >> Julien
> >>
> >>
> >> On 29 October 2012 19:17, kiran chitturi <chitturikira...@gmail.com>
> >> wrote:
> >>
> >> > Hi!
> >> >
> >> > I am debugging nutch with eclipse and i have found out that some pdf
> >> files
> >> > which are not succesfully parsed have majorCode as 2 and minorCode as
> >> 200
> >> > and files which are succesfully parsed have majorCode 1 and minorCode
> 0.
> >> >
> >> > Can someone please explain me or point to what these codes mean ?
> >> >
> >> > Actually, the title, text and everything is parsed in the failed
> parses
> >> but
> >> > somehow because of the codes it not saving the fields and returning as
> >> > failed parsing.
> >> >
> >> > Thanks for your help.
> >> >
> >> > Regards,
> >> > --
> >> > Kiran Chitturi
> >> >
> >>
> >>
> >>
> >> --
> >> *
> >> *Open Source Solutions for Text Engineering
> >>
> >> http://digitalpebble.blogspot.com/
> >> http://www.digitalpebble.com
> >> http://twitter.com/digitalpebble
> >>
> >
> >
> >
> > --
> > Kiran Chitturi
> >
> >
>
>
> --
> Kiran Chitturi
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 2.x parse MajorCode, MinorCode

Reply via email to