[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

dibyendu ghosh (Commented) (JIRA) Fri, 02 Dec 2011 02:02:09 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161520#comment-13161520
 ]


dibyendu ghosh commented on NUTCH-1206:
---------------------------------------

Output of my original test with 1.4:
=======================
bash-2.00$ java TestParse direct.pdf
Converting direct.pdf to html.
All parsing attempts failed
bash-2.00$ cat hadoop.log
2011-12-02 15:39:15,356 INFO  plugin.PluginRepository - Plugins: looking in: /sp
ace/dibyendu/nutch/1.4/runtime/local/plugins
2011-12-02 15:39:15,611 INFO  plugin.PluginRepository - Plugin Auto-activation m
ode: [true]
2011-12-02 15:39:15,611 INFO  plugin.PluginRepository - Registered Plugins:
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         the nutch core e
xtension points (nutch-extensionpoints)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Basic URL Normal
izer (urlnormalizer-basic)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Html Parse Plug-
in (parse-html)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Basic Indexing F
ilter (index-basic)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Http / Https Pro
tocol Plug-in (protocol-httpclient)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         HTTP Framework (
lib-http)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Regex URL Filter
 (urlfilter-regex)
2011-12-02 15:39:15,612 INFO  plugin.PluginRepository -         Pass-through URL
 Normalizer (urlnormalizer-pass)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Regex URL Normal
izer (urlnormalizer-regex)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Tika Parser Plug
-in (parse-tika)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         OPIC Scoring Plu
g-in (scoring-opic)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         CyberNeko HTML P
arser (lib-nekohtml)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Anchor Indexing
Filter (index-anchor)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Regex URL Filter
 Framework (lib-regex-filter)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository - Registered Extension-Poi
nts:
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Nutch URL Normal
izer (org.apache.nutch.net.URLNormalizer)
2011-12-02 15:39:15,613 INFO  plugin.PluginRepository -         Nutch Protocol (
org.apache.nutch.protocol.Protocol)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         Nutch Segment Me
rge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         Nutch URL Filter
 (org.apache.nutch.net.URLFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         Nutch Indexing F
ilter (org.apache.nutch.indexer.IndexingFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         HTML Parse Filte
r (org.apache.nutch.parse.HtmlParseFilter)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         Nutch Content Pa
rser (org.apache.nutch.parse.Parser)
2011-12-02 15:39:15,614 INFO  plugin.PluginRepository -         Nutch Scoring (o
rg.apache.nutch.scoring.ScoringFilter)
2011-12-02 15:39:16,794 WARN  parse.ParseUtil - Unable to successfully parse con
tent file:direct.pdf of type application/pdf
2011-12-02 15:39:16,885 WARN  parse.ParseResult - file:direct.pdf is not parsed
successfully, filtering
bash-2.00$ echo $CLASSPATH
conf:lib/nutch-1.4.jar:lib/log4j-1.2.15.jar:lib/commons-logging-1.1.1.jar:lib/ha
doop-core-0.20.2.jar:lib/oro-2.0.8.jar:lib/tika-core-0.10.jar:lib/slf4j-api-1.6.
1.jar:lib/slf4j-log4j12-1.6.1.jar:.
=======================
                
> tika parser of nutch 1.3 is failing to prcess pdfs
> --------------------------------------------------
>
>                 Key: NUTCH-1206
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1206
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>         Environment: Solaris/Linux/Windows
>            Reporter: dibyendu ghosh
>            Assignee: Chris A. Mattmann
>         Attachments: direct.pdf
>
>
> Please refer to this message: 
> http://www.mail-archive.com/user%40nutch.apache.org/msg04315.html. Old 
> parse-pdf parser seems to be able to parse old pdfs (checked with nutch 1.2) 
> though it is not able to parse acrobat 9.0 version of pdfs. nutch 1.3 does 
> not have parse-pdf plugin and it is not able to parse even older pdfs.
> my code (TestParse.java):
> ----------------------------
> bash-2.00$ cat TestParse.java
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileOutputStream;
> import java.io.PrintStream;
> import java.util.Iterator;
> import java.util.Map;
> import java.util.Map.Entry;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.io.Text;
> import org.apache.nutch.metadata.Metadata;
> import org.apache.nutch.parse.ParseResult;
> import org.apache.nutch.parse.Parse;
> import org.apache.nutch.parse.ParseStatus;
> import org.apache.nutch.parse.ParseUtil;
> import org.apache.nutch.parse.ParseData;
> import org.apache.nutch.protocol.Content;
> import org.apache.nutch.util.NutchConfiguration;
> public class TestParse {
>     private static Configuration conf = NutchConfiguration.create();
>     public TestParse() {
>     }
>     public static void main(String[] args) {
>         String filename = args[0];
>         convert(filename);
>     }
>     public static String convert(String fileName) {
>         String newName = "abc.html";
>         try {
>             System.out.println("Converting " + fileName + " to html.");
>             if (convertToHtml(fileName, newName))
>                 return newName;
>         } catch (Exception e) {
>             (new File(newName)).delete();
>             System.out.println("General exception " + e.getMessage());
>         }
>         return null;
>     }
>     private static boolean convertToHtml(String fileName, String newName)
>         throws Exception {
>         // Read the file
>         FileInputStream in = new FileInputStream(fileName);
>         byte[] buf = new byte[in.available()];
>         in.read(buf);
>         in.close();
>         // Parse the file
>         Content content = new Content("file:" + fileName, "file:" +
> fileName,
>                                       buf, "", new Metadata(), conf);
>         ParseResult parseResult = new ParseUtil(conf).parse(content);
>         parseResult.filter();
>         if (parseResult.isEmpty()) {
>             System.out.println("All parsing attempts failed");
>             return false;
>         }
>         Iterator<Map.Entry&lt;Text,Parse>> iterator =
> parseResult.iterator();
>         if (iterator == null) {
>             System.out.println("Cannot iterate over successful parse
> results");
>             return false;
>         }
>         Parse parse = null;
>         ParseData parseData = null;
>         while (iterator.hasNext()) {
>             parse = parseResult.get((Text)iterator.next().getKey());
>             parseData = parse.getData();
>             ParseStatus status = parseData.getStatus();
>             // If Parse failed then bail
>             if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
>                 System.out.println("Could not parse " + fileName + ". " +
>                             status.getMessage());
>                 return false;
>             }
>         }
>         // Start writing to newName
>         FileOutputStream fout = new FileOutputStream(newName);
>         PrintStream out = new PrintStream(fout, true, "UTF-8");
>         // Start Document
>         out.println("<html>");
>         // Start Header
>         out.println("<head>");
>         // Write Title
>         String title = parseData.getTitle();
>         if (title != null && title.trim().length() > 0) {
>             out.println("<title>" + parseData.getTitle() + "</title>");
>         }
>         // Write out Meta tags
>         Metadata metaData = parseData.getContentMeta();
>         String[] names = metaData.names();
>         for (String name : names) {
>             String[] subvalues = metaData.getValues(name);
>             String values = null;
>             for (String subvalue : subvalues) {
>                 values += subvalue;
>             }
>             if (values.length() > 0)
>                 out.printf("<meta name=\"%s\" content=\"%s\"/>\n",
>                            name, values);
>         }
>         out.println("<meta http-equiv=\"Content-Type\"
> content=\"text/html;charset=UTF-8\"/>");
>         // End Meta tags
>         out.println("</head>"); // End Header
>         // Start Body
>         out.println("<body>");
>         out.print(parse.getText());
>         out.println("</body>"); // End Body
>         out.println("</html>"); // End Document
>         out.close(); // Close the file
>         return true;
>     }
> }
> ----------------------------
> command:
> ======
> bash-2.00$ java -classpath
> conf:runtime/local/lib/nutch-1.3.jar:runtime/local/lib/hadoop-core-0.20.2.jar:runtime/local/lib/commons-logging-api-1.0.4.jar:runtime/local/lib/tika-core-0.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/oro-2.0.8.jar:.
> TestParse direct.pdf
> ======
> output:
> _____
> Converting direct.pdf to html.
> Oct 19, 2011 5:05:19 PM org.apache.hadoop.conf.Configuration
> getConfResourceAsInputStream
> INFO: found resource tika-mimetypes.xml at
> file:/path/to/nutch/1.3/conf/tika-mimetypes.xml
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginManifestParser
> parsePluginFolder
> INFO: Plugins: looking in: /path/to/nutch/1.3/runtime/local/plugins
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO: Plugin Auto-activation mode: [true]
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO: Registered Plugins:
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   the nutch core extension points (nutch-extensionpoints)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Tika Parser Plug-in (parse-tika)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO: Registered Extension-Points:
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Protocol (org.apache.nutch.protocol.Protocol)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Segment Merge Filter
> (org.apache.nutch.segment.SegmentMergeFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch URL Filter (org.apache.nutch.net.URLFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Content Parser (org.apache.nutch.parse.Parser)
> Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginRepository
> displayStatusINFO:   Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> Oct 19, 2011 5:05:20 PM org.apache.hadoop.conf.Configuration
> getConfResourceAsInputStream
> INFO: found resource parse-plugins.xml at
> file:/path/to/nutch/1.3/conf/parse-plugins.xml
> Oct 19, 2011 5:05:20 PM org.apache.nutch.parse.ParserFactory matchExtensions
> INFO: The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are
> enabled via the plugin.includes system property, and all claim to support
> the content type application/pdf, but they are not mapped to it  in the
> parse-plugins.xml file
> Oct 19, 2011 5:05:21 PM org.apache.nutch.parse.ParseUtil parse
> WARNING: Unable to successfully parse content file:direct.pdf of type
> application/pdf
> Oct 19, 2011 5:05:21 PM org.apache.nutch.parse.ParseResult filter
> WARNING: file:direct.pdf is not parsed successfully, filtering
> All parsing attempts failed
> _____
> my customized nutch-site.xml:
> ~~~~~~~~~~~~~~~~~~~~
> bash-2.00$ cat conf/nutch-site.xml
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
>   <property>
>     <name>plugin.folders</name>
>     <value>runtime/local/plugins</value>
>     <description>Directories where nutch plugins are located.  Each
>     element may be a relative or absolute path.  If absolute, it is used
>     as is.  If relative, it is searched for on the classpath.</description>
>   </property>
>   <property>
>     <name>plugin.includes</name>
>     <value>parse-tika</value>
>     <description>Regular expression naming plugin directory names to
>     include. Any plugin not matching this expression is excluded.
>     </description>
>   </property>
> </configuration>
> ~~~~~~~~~~~~~~~~~~~~

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1206) tika parser of nutch 1.3 is failing to prcess pdfs

Reply via email to