I first extract the contents from documents using tika and latter index it with Lucene. The problem is the extracted text from PDF using tika has no whitespaces.
Regards Ganesh ----- Original Message ----- From: "McGibbney, Lewis John" <lewis.mcgibb...@gcu.ac.uk> To: <java-user@lucene.apache.org> Sent: Friday, December 03, 2010 4:40 PM Subject: RE: PDF text extracted without spaces > Hi Ganesh > > I encountered this same problem last week. I was thinking if it was possible > to include at minimum a WhitespaceAnalyzer somewhere within Tika which would > solve the problem. I am not sure of how this would be done as I am not > familiar with Tika codebase. > > Unfortunately I don't think that the solution to the first part of this > problem lies within the java-user mailing list. > > When were you sending extracted contents to Lucene... at what later stage? > > Thank you > > Lewis > > -----Original Message----- > From: Ganesh [mailto:emailg...@yahoo.co.in] > Sent: 03 December 2010 10:44 > To: java-user@lucene.apache.org > Subject: Re: PDF text extracted without spaces > > The main problem is i am not getting whitespace and newline char. This is > happening only for PDF documents. > > Sample outoput: Someofthedifferencesare but it should be Some of the > differences are > > Regards > Ganesh > > ----- Original Message ----- > From: "Alexander Aristov" <alexander.aris...@gmail.com> > To: <java-user@lucene.apache.org> > Sent: Friday, December 03, 2010 2:39 PM > Subject: Re: PDF text extracted without spaces > > >> anyway even if you get correct whitespaces and new lines this won't affect >> indexing. >> >> Best Regards >> Alexander Aristov >> >> >> On 3 December 2010 10:00, Lance Norskog <goks...@gmail.com> wrote: >> >>> The text should come out as a stream of words with space, but without >>> any of the formatting in the PDF. Extraction is only good enough to >>> tell you that a word is somewhere inside a PDF file. Can you post a >>> short bit of the text that it extracted? >>> >>> Also, you should try this test on different PDF files that were made >>> with different software. >>> >>> On Thu, Dec 2, 2010 at 9:35 PM, Ganesh <emailg...@yahoo.co.in> wrote: >>> > Hello all, >>> > >>> > I know, this is not the right group to ask this question, thought some of >>> you guys might have experienced. >>> > >>> > I newbie with Tika. I am using latest version 0.8 version. I extracted >>> text from PDF document but found spaces and new line missing. Indexing the >>> data gives wrong result. Could any one in this group could help me? I am >>> using tika directly to extract the contents, which later gets indexed. >>> > >>> > Regards >>> > Ganesh >>> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. >>> Download Now! http://messenger.yahoo.com/download.php >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> > For additional commands, e-mail: java-user-h...@lucene.apache.org >>> > >>> > >>> >>> >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download > Now! http://messenger.yahoo.com/download.php > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > Email has been scanned for viruses by Altman Technologies' email management > service - www.altman.co.uk/emailsystems > > Glasgow Caledonian University is a registered Scottish charity, number > SC021474 > > Winner: Times Higher Education’s Widening Participation Initiative of the > Year 2009 and Herald Society’s Education Initiative of the Year 2009 > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. Download Now! http://messenger.yahoo.com/download.php --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org