ExcepExtractor performance bad due to String concatenation ----------------------------------------------------------
Key: NUTCH-473 URL: https://issues.apache.org/jira/browse/NUTCH-473 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 0.9.0 Environment: Tested under Windows, Java 1.5 and 1.6 Reporter: Antony Bowesman Using 0.9 version of ExcelExtractor was still running after 4 hours at 100% CPU trying to extract the text from a 3MB Excel file containing 26 sheets, half with a matrix of approx 1100 rows x P columns and the others with approx 1000 rows x E columns. After changing ExcelExtractor to use StringBuffer the same extraction process took 3 seconds under Java 1.5. Code changes below - example uses a 4K buffer per sheet - this was a completely arbitrary choice but keeps the number of StringBuffer expansions low for large files without using too much space for small files. protected String extractText(InputStream input) throws Exception { String resultText = ""; HSSFWorkbook wb = new HSSFWorkbook(input); if (wb == null) { return resultText; } HSSFSheet sheet; HSSFRow row; HSSFCell cell; int sNum = 0; int rNum = 0; int cNum = 0; sNum = wb.getNumberOfSheets(); // Allow 4K per sheet - seems a reasonable start StringBuffer sb = new StringBuffer(4096 * sNum); for (int i=0; i<sNum; i++) { if ((sheet = wb.getSheetAt(i)) == null) { continue; } rNum = sheet.getLastRowNum(); for (int j=0; j<=rNum; j++) { if ((row = sheet.getRow(j)) == null){ continue; } cNum = row.getLastCellNum(); for (int k=0; k<cNum; k++) { if ((cell = row.getCell((short) k)) != null) { /*if(HSSFDateUtil.isCellDateFormatted(cell) == true) { resultText += cell.getDateCellValue().toString() + " "; } else */ if (cell.getCellType() == HSSFCell.CELL_TYPE_STRING) { sb.append(cell.getStringCellValue()); sb.append(' '); // resultText += cell.getStringCellValue() + " "; } else if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC) { Double d = new Double(cell.getNumericCellValue()); sb.append(d.toString()); sb.append(' '); // resultText += d.toString() + " "; } /* else if(cell.getCellType() == HSSFCell.CELL_TYPE_FORMULA){ resultText += cell.getCellFormula() + " "; } */ } } } } return sb.toString(); } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers