*./nutch parsechecker -D http.agent.name="tralala" -D http.content.limit=-1 -dumpText http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf*
works absolutely fine in both the trunk and 2.x branch. try from the runtime/local/bin directory and check the logs for more details On 30 October 2012 13:54, kiran chitturi <chitturikira...@gmail.com> wrote: > Interestingly, the tika jar i have downloaded separately is able to parse > all the text from the pdf files while the nutch tika parser is failing for > some of the files. I have set the content.limit to -1. > > The error message is '2012-10-30 09:30:37,382 WARN parse.ParseUtil - > Unable to successfully parse content > http://scholar.lib.vt.edu/ejournals/ALAN/v29n3/pdf/yearwood.pdf of type > application/pdf' > > for the failed pdf files. I could see some title and text when i am > debugging in Eclipse but i could see it failing due to the parseCodes. > > Thank you. > Kiran > > On Tue, Oct 30, 2012 at 8:58 AM, kiran chitturi > <chitturikira...@gmail.com>wrote: > > > Hi > > > > I did not sent the content limit to -1 but i have set it high enough to > be > > able to go through the documents that i am parsing. I could see some > title > > and text but i am not sure how much it is able to do. I am gonna try > using > > tika separately and try to process the documents. If all of it goes > through > > tika-1.2 separately then i have to try to debug where i am getting the > > error here. > > > > Many Thanks, > > Kiran. > > > > > > On Tue, Oct 30, 2012 at 4:37 AM, Julien Nioche < > > lists.digitalpeb...@gmail.com> wrote: > > > >> Hi > >> > >> Look at the code for the class ParseStatusCodes. This simply indicates > >> that > >> the parsing failed and is not the cause for the failing itself. Do you > get > >> the entire text for the document or just what the parser managed to > >> process > >> until it failed? Did you set the content limit to -1? > >> > >> Thanks > >> > >> Julien > >> > >> > >> On 29 October 2012 19:17, kiran chitturi <chitturikira...@gmail.com> > >> wrote: > >> > >> > Hi! > >> > > >> > I am debugging nutch with eclipse and i have found out that some pdf > >> files > >> > which are not succesfully parsed have majorCode as 2 and minorCode as > >> 200 > >> > and files which are succesfully parsed have majorCode 1 and minorCode > 0. > >> > > >> > Can someone please explain me or point to what these codes mean ? > >> > > >> > Actually, the title, text and everything is parsed in the failed > parses > >> but > >> > somehow because of the codes it not saving the fields and returning as > >> > failed parsing. > >> > > >> > Thanks for your help. > >> > > >> > Regards, > >> > -- > >> > Kiran Chitturi > >> > > >> > >> > >> > >> -- > >> * > >> *Open Source Solutions for Text Engineering > >> > >> http://digitalpebble.blogspot.com/ > >> http://www.digitalpebble.com > >> http://twitter.com/digitalpebble > >> > > > > > > > > -- > > Kiran Chitturi > > > > > > > -- > Kiran Chitturi > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble