Re: Text search on a PDF file using hadoop

GaneshG Wed, 30 Jul 2008 04:26:25 -0700

Thanks Joman, i tried pdfbox, it converts pdfs to text files. On these files
hadoop search is working fine. but, performance aspect it is not good, since
we have to find first the file type is pdf or not then we have to convert
it. Also its generating txt files with same name of the original pdf. so if
we already have index.txt and we try to convert the index.pdf, then it will
be the problem for searches. Better we have to find someother way...




Joman Chu-2 wrote:
> 
> I've been investigating this recently, and I came across Apache PDFBox
> (http://incubator.apache.org/projects/pdfbox.html), which may
> accomplish this in native Java. Try it out and get back to us on how
> well it works, I'd be curious to know.
> 
> Joman Chu
> AIM: ARcanUSNUMquam
> IRC: irc.liquid-silver.net
> 
> 
> On Wed, Jul 23, 2008 at 9:39 AM, Dhruba Borthakur <[EMAIL PROTECTED]>
> wrote:
>> One option for you is to use a pdf-to-text converter (many of them are
>> available online) and then run map-reduce on the txt file.
>>
>> -dhruba
>>
>> On Wed, Jul 23, 2008 at 1:07 AM, GaneshG
>> <[EMAIL PROTECTED]> wrote:
>>>
>>> Thanks Lohit, i am using only defalult reader and i am very new to
>>> hadoop.
>>> This is my map method
>>>
>>>      public void map(LongWritable key, Text value, OutputCollector<Text,
>>> Text> output, Reporter reporter) throws IOException {
>>>        String line = value.toString();
>>>        StringTokenizer tokenizer = new StringTokenizer(line);
>>>        while (tokenizer.hasMoreTokens()) {
>>>
>>>                String val = tokenizer.nextToken();
>>>                try {
>>>
>>>                if (val != null && val.contains("the")) {
>>>                        word.set(line);
>>>                        FileSplit spl =
>>> (FileSplit)reporter.getInputSplit();
>>>                        output.collect(word, new
>>> Text(spl.getPath().getName()));
>>>                }
>>>                } catch (Exception e) {
>>>                        System.out.println(e);
>>>                }
>>>        }
>>>      }
>>>    }
>>>
>>> I have a pdf file in my dfs input folder. can you tell me what i have to
>>> do
>>> to read pdf files?
>>>
>>> Thanks
>>> Ganesh.G
>>>
>>>
>>> lohit-2 wrote:
>>>>
>>>> Can you provide more information. How are you passing your input, are
>>>> you
>>>> passing raw pdf files? If so, are you using your own record reader.
>>>> Default record reader wont read pdf files and you wont get the text out
>>>> of
>>>> it as is.
>>>> Thanks,
>>>> Lohit
>>>>
>>>>
>>>>
>>>> ----- Original Message ----
>>>> From: GaneshG <[EMAIL PROTECTED]>
>>>> To: [email protected]
>>>> Sent: Wednesday, July 23, 2008 1:51:52 AM
>>>> Subject: Text search on a PDF file using hadoop
>>>>
>>>>
>>>> while i search a text in a pdf file using hadoop, the results are not
>>>> coming
>>>> properly. i tried to debug my program, i could see the lines red from
>>>> pdf
>>>> file is not formatted. please help me to resolve this.
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Text-search-on-a-PDF-file-using-hadoop-tp18606475p18606475.html
>>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18606703.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Re%3A-Text-search-on-a-PDF-file-using-hadoop-tp18606558p18731134.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Text search on a PDF file using hadoop

Reply via email to