On May 29, 2010, at 4:21 AM, Yegor Kozlov wrote: > On 5/27/10 12:49 AM, [email protected] wrote: >> Couple of thoughts: >> >> 1. I did a test with "real" data in a controlled environment, and >> Identified that in the real world condition this change did help >> significantly for large sheets that did not require re-ordering. 12 Min >> processing time dropped to 4ish Minutes. >> >> > > I made a similar improvement in XSSFRow#onDocumentWrite(), this method always > re-assigned the array of CTCell even if the cells were ordered. > The performance boost is not so big as we got from the optimization of > XSSFSheet#write. The number of cells is usually small and the call of > CTRow#setCArray is not that expensive, the improvement should be significant > with 'wide' sheets having more than 100 columns. > > The fix was committed in r949375 >> 2. I have been giving it a bit more thought, and would like to through >> the idea of "block" re-order comparison for a further improvements where >> the data does not match. The basic idea is that because the rows are >> maintained in a linked list in xml beans, can we on the poi side of >> things look at only comparing blocks of data and doing some of the list >> manipulation as needed, but skip blocks where we know or can predict >> that the underlying list is ordered correctly. >> >> I am just floating this idea to the list. I will not be able to dig >> into it for a bit... >> I would love to hear thoughts? >> >> > I don't quite understand. Are you suggesting to manipulate low-level DOM > objects and swap only those pieces that are not sorted? > In theory you can do that using XmlCursor but the code will be tricky and > harder to maintain. Also, the performance gain is not evident to me. > > An alternative approach is to improve XSSFSheet#createRow so that the array > of CTRow beans is kept in a sorted state. Current implementation appends new > rows, this is why we need the trick with re-ordering to ensure that the > resulting XML is valid. If we do this optimization then the re-ordering trick > is extra and can be removed. The performance of XSSFSheet#createRow can be > slower (random insertions in LinkedList!) but the performance of > XSSFSheet#write will be definitely faster.
Why a Single LinkedList? If you do use one a linked list - then try a Double-Linked and insert from the end. This will make in order insertions O(1) - instead of O(n). Regards, Dave > > Yegor >> Bryce Alcock * Performance Architect * SunGard * Asset Arena * >> 377 E. Butterfield Road Suite 800, Lombard, IL 60148 * >> www.sungard.com/assetarena >> >> Think before you print >> >> >> >> -----Original Message----- >> From: [email protected] [mailto:[email protected]] >> Sent: Monday, May 24, 2010 10:06 AM >> To: [email protected]; [email protected] >> Subject: RE: Performance Question with CTSheetDataImpl.java >> >> The performance improvement is significant for my simple "Laboratory" >> test. >> See attached PNG file showing Yegor's code vs. Poi before the change. >> >> >> Yegor's Code Changes Inplace: >> Note that the AVG Time per Row does not change (O(1) per row leads to >> O(n) over all) >> >> File Name Rows Total AVG >> Time Mills/ >> Millis Row >> ./20000_true.xls 20000 1964 0.1 >> ./24000_true.xls 24000 2019 0.08 >> ./28800_true.xls 28800 2624 0.09 >> ./34560_true.xls 34560 3012 0.09 >> ./41472_true.xls 41472 3409 0.08 >> ./49766_true.xls 49766 4617 0.09 >> ./59719_true.xls 59719 6162 0.1 >> ./71662_true.xls 71662 5705 0.08 >> ./85994_true.xls 85994 7487 0.09 >> ./103192_true.xls 103192 9060 0.09 >> ./123830_true.xls 123830 11875 0.1 >> ./148596_true.xls 148596 13591 0.09 >> >> >> Original Poi (pre row order check) >> Notice that the AVG Mills/Row increases linearly with rows >> (O(n) which leads to O(n^2) as seen in the Total Time) >> >> File Name Rows Total AVG >> Time Mills/ >> Millis Row >> ./20000_true.xls 20000 4655 0.23 >> ./24000_true.xls 24000 6398 0.27 >> ./28800_true.xls 28800 7626 0.26 >> ./34560_true.xls 34560 10178 0.29 >> ./41472_true.xls 41472 17647 0.43 >> ./49766_true.xls 49766 24645 0.5 >> ./59719_true.xls 59719 38765 0.65 >> ./71662_true.xls 71662 63009 0.88 >> ./85994_true.xls 85994 98448 1.14 >> ./103192_true.xls 103192 144313 1.4 >> ./123830_true.xls 123830 217893 1.76 >> ./148596_true.xls 148596 320606 2.16 >> >> >> >> Thanks for your effort's Yegor. >> I will test this in the "wild" with real data soon. >> >> >> Bryce Alcock * Performance Architect * SunGard * Asset Arena * 377 E. >> Butterfield Road Suite 800, Lombard, IL 60148 * Tel 630-986-3006 * >> Confidential Fax 630-515-1908 * www.sungard.com/assetarena >> >> >> >> -----Original Message----- >> From: Yegor Kozlov [mailto:[email protected]] >> Sent: Mon 5/24/2010 12:58 AM >> To: POI Developers List >> Cc: Alcock, Bryce >> Subject: Re: Performance Question with CTSheetDataImpl.java >> >> the purpose of calling CTSheetData#setRowArray is to ensure that the >> array of CTRow beans in CTSheetData is ordered. The point is that we >> don't always need this operation, in many cases rows are already ordered >> and re-assigning them to CTSheetData is an unnecessary expensive >> operation. >> >> Compare two use cases: >> >> case 1: rows are written in increasing order >> >> XSSFSheet sheet = workbook.createSheet(); sheet.createRow(1); >> sheet.createRow(2); sheet.createRow(3); workbook.write(out); //no need >> to re-order rows before writing >> >> >> case 2: row are created in random order >> >> XSSFSheet sheet = workbook.createSheet(); sheet.createRow(3); >> sheet.createRow(2); sheet.createRow(1); workbook.write(out); //need to >> re-order rows before writing >> >> In the first case the call of CTSheetData#setRowArray is extra because >> the state of CTSheetData match the logical model. In the second case we >> do need to call CTSheetData#setRowArray to ensure that rows are written >> in ascending order. >> >> So, the performance of XSSFSheet#write depends on how rows were added: >> if rows are added in strict ascending order then XSSFSheet#write is >> fast. If the order of rows is random or rows were shifted then >> XSSFSheet#write involves CTSheetData#setRowArray and becomes more >> expensive. >> >> I committed this improvement in r947542. Please try the latest build >> from trunk. >> >> Below are two my tests, the only difference is that in the first test >> rows are written in ascending order and in the second test in >> descending. >> >> public static void main(String[] args) throws Exception { >> long t1 = System.currentTimeMillis(); >> >> XSSFWorkbook wb = new XSSFWorkbook(); >> XSSFSheet sh = wb.createSheet(); >> for (int i = 0; i< Short.MAX_VALUE; i++) { >> XSSFRow row = sh.createRow(i); >> for (int j = 0; j< 5; j++) { >> XSSFCell cell = row.createCell(j); >> cell.setCellValue(i*j); >> } >> } >> long t2 = System.currentTimeMillis(); >> System.out.println("generate: " + (t2-t1) + " ms"); >> >> FileOutputStream out = new FileOutputStream(new >> File("/temp/rows.xlsx")); >> wb.write(out); >> out.close(); >> long t3 = System.currentTimeMillis(); >> System.out.println("write: " + (t3-t2) + " ms"); >> } >> >> >> generate: 5488 ms >> write: 2507 ms >> >> public static void main(String[] args) throws Exception { >> long t1 = System.currentTimeMillis(); >> >> XSSFWorkbook wb = new XSSFWorkbook(); >> XSSFSheet sh = wb.createSheet(); >> for (int i = Short.MAX_VALUE; i> 0; i--) { >> XSSFRow row = sh.createRow(i); >> for (int j = 0; j< 5; j++) { >> XSSFCell cell = row.createCell(j); >> cell.setCellValue(i*j); >> } >> } >> long t2 = System.currentTimeMillis(); >> System.out.println("generate: " + (t2-t1) + " ms"); >> >> FileOutputStream out = new FileOutputStream(new >> File("/temp/rows.xlsx")); >> wb.write(out); //will involve CTSheetData#setRowArray >> out.close(); >> long t3 = System.currentTimeMillis(); >> System.out.println("write: " + (t3-t2) + " ms"); >> } >> >> generate: 9291 ms >> write: 94888 ms >> >> The first test is almost 40X faster! >> >> Regards, >> Yegor >> >> >> >> >>> Continuing to chip away at CTSheetDataImpl's performance issues. >>> However, the problematic code is unlikely to be changed, I have a >>> >> discussion going on the XMLBeans website, and as it turns out that code >> is most likely not going to change, or will not change easily. >> >>> >>> With that in mind, I started taking a close look at the POI side of >>> >> things and very specifically the problematic code from a POI >> perspective. >> >>> >>> At approximately line 2368 of XSSFSheet.java >>> >>> the following line is the suspect line: >>> I have modified the code to break 1 line into 3 the slow line is the >>> >> last one. >> >>> CTRow[] lctRow = new CTRow[rArray.size()]; >>> lctRow = rArray.toArray(lctRow); >>> sheetData.setRowArray(lctRow); >>> >>> >>> >>> My Question is: in looking at XSSFSheet.java >>> are there any assumption that we can make about the state of sheetData >>> >> (A CTSheetData object) >> >>> that would allow us to either overwrite the default >>> >> setRowArray(lctRow) method, >> >>> or side step this call all together by calling a method that is not >>> >> generic, but instead much >> >>> more targeted to the function and process that is currently going on. >>> >>> >>> >>> >>> >>> >>> >>> >>> Here is the full call in context...Take from my modified version of >>> >> XSSFSheet.java >> >>> >>> protected void write(OutputStream out) throws IOException { >>> >>> if(worksheet.getColsArray().length == 1) { >>> CTCols col = worksheet.getColsArray(0); >>> CTCol[] cols = col.getColArray(); >>> if(cols.length == 0) { >>> worksheet.setColsArray(null); >>> } >>> } >>> >>> // Now re-generate our CTHyperlinks, if needed >>> if(hyperlinks.size()> 0) { >>> if(worksheet.getHyperlinks() == null) { >>> worksheet.addNewHyperlinks(); >>> } >>> CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()]; >>> for(int i=0; i<ctHls.length; i++) { >>> // If our sheet has hyperlinks, have them add >>> // any relationships that they might need >>> XSSFHyperlink hyperlink = hyperlinks.get(i); >>> hyperlink.generateRelationIfNeeded(getPackagePart()); >>> // Now grab their underling object >>> ctHls[i] = hyperlink.getCTHyperlink(); >>> } >>> worksheet.getHyperlinks().setHyperlinkArray(ctHls); >>> } >>> >>> CTSheetData sheetData = worksheet.getSheetData(); >>> ArrayList<CTRow> rArray = new ArrayList<CTRow>(rows.size()); >>> for(XSSFRow row : rows.values()){ >>> row.onDocumentWrite(); >>> rArray.add(row.getCTRow()); >>> } >>> long startTimeSheetData = System.currentTimeMillis(); >>> CTRow[] lctRow = new CTRow[rArray.size()]; >>> lctRow = rArray.toArray(lctRow); >>> sheetData.setRowArray(lctRow); >>> System.out.println("Sheet Time : " + >>> >> (System.currentTimeMillis() - startTimeSheetData)); >> >>> >>> XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS); >>> xmlOptions.setSaveSyntheticDocumentElement(new >>> >> QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet")); >> >>> Map<String, String> map = new HashMap<String, String>(); >>> map.put(STRelationshipId.type.getName().getNamespaceURI(), >>> >> "r"); >> >>> xmlOptions.setSaveSuggestedPrefixes(map); >>> long startTimeWorkSheetSave = >>> >> System.currentTimeMillis(); >> >>> worksheet.save(out, xmlOptions); >>> System.out.println("Worksheet save Time : " + >>> >> (System.currentTimeMillis() - startTimeWorkSheetSave)); >> >>> } >>> >>> >>> >>> Best Regards >>> Bryce Alcock >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >>> >>> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
