Hi Mohamed, Thank you for your question. Your code below looks like it's accomplishing the basics, and the requirements of assignment #1. BTW, I'm CC'ing [email protected].
The (optional check OCR quality) refers to the fact that in Tika 1.5, we rely on PDF parsing code that doesn't always get the text chars correctly out of PDF files, so this "check OCR quality" step was suggesting that if you can figure out a way to check that you may want to. Or alternatively, you may want to check out TIKA-93 [1] and Grant's work there and help me test it out. Cheers, Chris [1] https://issues.apache.org/jira/browse/TIKA-93 -----Original Message----- From: Mohamed Mustafa Rafik Khimani <[email protected]> Date: Saturday, February 22, 2014 5:23 PM To: Chris Mattmann <[email protected]> Subject: Re: CSCI ASSIGNMENT QUESTION >Hello Professor Mattmann, > > >Thank you for your reply. > > >Currently, I am doing the following: > > >InputStream is = new BufferedInputStream(new FileInputStream(f)); > >Parser parser = new PDFParser(); > >ContentHandler handler = new BodyContentHandler(-1); > > >Metadata metadata = new Metadata(); > > > //Use the parser to parse each PDF file >parser.parse(is, handler, metadata, new ParseContext()); > > //Get the content of the pdf files as a string >String content = handler.toString(); > > > >//For every keyword, we check to see if it is present in the file and >update keyword_counts and num_fileswithkeywords accordingly > >for(String keyword:keywords) >{ >if(content.contains(keyword)) >{ >updatelog(keyword,f.getName()); >int count = keyword_counts.get(keyword); >count++; >keyword_counts.put(keyword, count); >num_fileswithkeywords++; >} >} > > > >I do not understand what "(optional) check OCR quality before proceeding" >mean. Could you guide me where I can look for this. > > >The above code is processing all pdf files and printing the output as >needed. Could you please let me know if anything else is needed for the >assignment, except the optional OCR quality check. > > >Sincerely, > > >Mohamed Mustafa Khimani > > > > > >On Wed, Feb 19, 2014 at 7:33 PM, Chris Mattmann ><[email protected]> wrote: > >Thanks for your question Mohamed, feel free to send these >types of questions to [email protected]. It would be a >great place to ask them and tell your classmates too. > >I'm copying the list on this message. > >(BTW you can then find the mail in Google and other >mail archives after that) > >Sometimes the MIME type is incorrectly detected, and >the best bet is to file a JIRA issue here in Tika: > >https://issues.apache.org/jira/browse/TIKA > >and then attach the sample PDF file for testing. > >If you have to preprocess a file in your specific >assignment in CS572, that's fine too you can just >force it to automatically call the PDF parser by >calling it directly from your program or Java code >and then bypass that step. > >HTH! > >Cheers, >Chris > > >------------------------ >Chris Mattmann >[email protected] > > > > >-----Original Message----- >From: Mohamed Mustafa Rafik Khimani <[email protected]> >Date: Wednesday, February 19, 2014 12:56 PM >To: Chris Mattmann <[email protected]> >Subject: CSCI ASSIGNMENT QUESTION > >>Hello Professor Mattmann, >>I have a doubt regarding the Tika assignment. I was trying to read one of >>the pdf files downloaded from the vault. I was unable to read the file >>using Tika class and the parse method, which was returning null for each >>line. >> >>When I tried to use the detect method, to check the Mime type of the >>file, it returns audio/mpeg. >> >>I tried using one of the known pdf files, which returned the correct mime >>type as well as was able to parse the file correctly. >> >>I wanted to confirm if I need to pre-process the file in anyway before I >>can extract the contents or if there might be a potential issue with the >>pdf files that I have downloaded, and may be consider re-downloading them >>? >> >>I am following the Tika in Action book. I have read the first 4 chapters >>and will be reading the content extraction chapter next. I was trying a >>few things while reading the text, so thought of asking you if this is >>expected or if I am going wrong somewhere. >> >>Thank you for your time. >> >>Sincerely, >> >>Mohamed Mustafa Khimani >> > > > > > > > > >
