[ https://issues.apache.org/jira/browse/NUTCH-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495229 ]
Sami Siren commented on NUTCH-472: ---------------------------------- > Not sure how to turn source code in description into a patch file, but the > fixed "extractText" method was included earlier. You can follow instructions on how to create patches on Nutch wiki http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer > NullPointerException in ZipTextExtractor if no MIME type for zipped file > ------------------------------------------------------------------------ > > Key: NUTCH-472 > URL: https://issues.apache.org/jira/browse/NUTCH-472 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 0.9.0 > Environment: Any > Reporter: Antony Bowesman > > extractText throws a NPE in > String contentType = MIME.getMimeType(fname).getName(); > if the file in the zip has no configured mime type which breaks the parsing > of the zip. > Code should do: > public String extractText(InputStream input, String url, List outLinksList) > throws IOException { > String resultText = ""; > byte temp; > > ZipInputStream zin = new ZipInputStream(input); > > ZipEntry entry; > > while ((entry = zin.getNextEntry()) != null) { > > if (!entry.isDirectory()) { > int size = (int) entry.getSize(); > byte[] b = new byte[size]; > for(int x = 0; x < size; x++) { > int err = zin.read(); > if(err != -1) { > b[x] = (byte)err; > } > } > String newurl = url + "/"; > String fname = entry.getName(); > newurl += fname; > URL aURL = new URL(newurl); > String base = aURL.toString(); > int i = fname.lastIndexOf('.'); > if (i != -1) { > // Trying to resolve the Mime-Type > MimeType mt = MIME.getMimeType(fname); > if (mt != null) { > String contentType = mt.getName(); > try { > Metadata metadata = new Metadata(); > metadata.set(Response.CONTENT_LENGTH, > Long.toString(entry.getSize())); > metadata.set(Response.CONTENT_TYPE, contentType); > Content content = new Content(newurl, base, b, contentType, > metadata, this.conf); > Parse parse = new ParseUtil(this.conf).parse(content); > ParseData theParseData = parse.getData(); > Outlink[] theOutlinks = theParseData.getOutlinks(); > > for(int count = 0; count < theOutlinks.length; count++) { > outLinksList.add(new Outlink(theOutlinks[count].getToUrl(), > theOutlinks[count].getAnchor(), this.conf)); > } > > resultText += entry.getName() + " " + parse.getText() + " "; > } catch (ParseException e) { > if (LOG.isInfoEnabled()) { > LOG.info("fetch okay, but can't parse " + fname + ", reason: " > + e.getMessage()); > } > } > } else { > resultText += entry.getName(); > } > } > } > } > > return resultText; > } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers