Re: Performance Question with CTSheetDataImpl.java

David Fisher Sat, 29 May 2010 09:38:03 -0700

On May 29, 2010, at 4:21 AM, Yegor Kozlov wrote:

> On 5/27/10 12:49 AM, [email protected] wrote:
>> Couple of thoughts:
>> 
>> 1.  I did a test with "real" data in a controlled environment, and
>> Identified that in the real world condition this change did help
>> significantly for large sheets that did not require re-ordering.  12 Min
>> processing time dropped to 4ish Minutes.
>> 
>>   
> 
> I made a similar improvement in XSSFRow#onDocumentWrite(), this method always 
> re-assigned the array of CTCell even if the cells were ordered.
> The performance boost is not so big as we got from the optimization of 
> XSSFSheet#write. The number of cells is usually small and the call of 
> CTRow#setCArray is not that expensive, the improvement should be significant 
> with 'wide' sheets having more than 100 columns.
> 
> The fix was committed in r949375
>> 2.  I have been giving it a bit more thought, and would like to through
>> the idea of "block" re-order comparison for a further improvements where
>> the data does not match.  The basic idea is that because the rows are
>> maintained in a linked list in xml beans, can we on the poi side of
>> things look at only comparing blocks of data and doing some of the list
>> manipulation as needed, but skip blocks where we know or can predict
>> that the underlying list is ordered correctly.
>> 
>> I am just floating this idea to the list.  I will not be able to dig
>> into it for a bit...
>> I would love to hear thoughts?
>> 
>>   
> I don't quite understand. Are you suggesting to manipulate low-level DOM 
> objects and swap only those pieces that are not sorted?
> In theory you can do that using XmlCursor but the code will be tricky and 
> harder to maintain. Also, the performance gain is not evident to me.
> 
> An alternative approach is to improve XSSFSheet#createRow so that the array 
> of CTRow beans is kept in a sorted state. Current implementation appends new 
> rows, this is why we need the trick with re-ordering to ensure that the 
> resulting XML is valid. If we do this optimization then the re-ordering trick 
> is extra and can be removed. The performance of XSSFSheet#createRow can be 
> slower (random insertions in LinkedList!) but the performance of 
> XSSFSheet#write will be definitely faster.


Why a Single LinkedList? If you do use one a linked list - then try a 
Double-Linked and insert from the end. This will make in order insertions O(1) 
- instead of O(n).

Regards,
Dave




> 
> Yegor
>> Bryce Alcock * Performance Architect * SunGard * Asset Arena *
>> 377 E. Butterfield Road Suite 800, Lombard, IL 60148 *
>> www.sungard.com/assetarena
>> 
>> Think before you print
>> 
>> 
>> 
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]]
>> Sent: Monday, May 24, 2010 10:06 AM
>> To: [email protected]; [email protected]
>> Subject: RE: Performance Question with CTSheetDataImpl.java
>> 
>> The performance improvement is significant for my simple "Laboratory"
>> test.
>> See attached PNG file showing Yegor's code vs. Poi before the change.
>> 
>> 
>> Yegor's Code Changes Inplace:
>> Note that the AVG Time per Row does not change (O(1) per row leads to
>> O(n) over all)
>> 
>> File Name            Rows    Total   AVG
>>                              Time    Mills/
>>                              Millis  Row     
>> ./20000_true.xls     20000   1964    0.1     
>> ./24000_true.xls     24000   2019    0.08    
>> ./28800_true.xls     28800   2624    0.09    
>> ./34560_true.xls     34560   3012    0.09    
>> ./41472_true.xls     41472   3409    0.08    
>> ./49766_true.xls     49766   4617    0.09    
>> ./59719_true.xls     59719   6162    0.1     
>> ./71662_true.xls     71662   5705    0.08    
>> ./85994_true.xls     85994   7487    0.09    
>> ./103192_true.xls    103192  9060    0.09    
>> ./123830_true.xls    123830  11875   0.1     
>> ./148596_true.xls    148596  13591   0.09    
>> 
>> 
>> Original Poi (pre row order check)
>> Notice that the AVG Mills/Row increases linearly with rows
>> (O(n) which leads to O(n^2) as seen in the Total Time)
>> 
>> File Name            Rows    Total   AVG
>>                              Time    Mills/
>>                              Millis  Row     
>> ./20000_true.xls     20000   4655    0.23
>> ./24000_true.xls     24000   6398    0.27
>> ./28800_true.xls     28800   7626    0.26
>> ./34560_true.xls     34560   10178   0.29
>> ./41472_true.xls     41472   17647   0.43
>> ./49766_true.xls     49766   24645   0.5
>> ./59719_true.xls     59719   38765   0.65
>> ./71662_true.xls     71662   63009   0.88
>> ./85994_true.xls     85994   98448   1.14
>> ./103192_true.xls    103192  144313  1.4
>> ./123830_true.xls    123830  217893  1.76
>> ./148596_true.xls    148596  320606  2.16
>> 
>> 
>> 
>> Thanks for your effort's Yegor.
>> I will test this in the "wild" with real data soon.
>> 
>> 
>> Bryce Alcock * Performance Architect * SunGard * Asset Arena * 377 E.
>> Butterfield Road Suite 800, Lombard, IL 60148 * Tel 630-986-3006 *
>> Confidential Fax 630-515-1908 * www.sungard.com/assetarena
>> 
>> 
>> 
>> -----Original Message-----
>> From: Yegor Kozlov [mailto:[email protected]]
>> Sent: Mon 5/24/2010 12:58 AM
>> To: POI Developers List
>> Cc: Alcock, Bryce
>> Subject: Re: Performance Question with CTSheetDataImpl.java
>> 
>> the purpose of calling CTSheetData#setRowArray is to ensure that the
>> array of CTRow beans in CTSheetData is ordered. The point is that we
>> don't always need this operation, in many cases rows are already ordered
>> and re-assigning them to CTSheetData is an unnecessary expensive
>> operation.
>> 
>> Compare two use cases:
>> 
>> case 1: rows are written in increasing order
>> 
>> XSSFSheet sheet = workbook.createSheet(); sheet.createRow(1);
>> sheet.createRow(2); sheet.createRow(3); workbook.write(out); //no need
>> to re-order rows before writing
>> 
>> 
>> case 2: row are created in random order
>> 
>> XSSFSheet sheet = workbook.createSheet(); sheet.createRow(3);
>> sheet.createRow(2); sheet.createRow(1); workbook.write(out); //need to
>> re-order rows before writing
>> 
>> In the first case the call of CTSheetData#setRowArray is extra because
>> the state of CTSheetData match the logical model. In the second case we
>> do need to call CTSheetData#setRowArray to ensure that rows are written
>> in ascending order.
>> 
>> So, the performance of XSSFSheet#write depends on how rows were added:
>> if rows are added in strict ascending order then XSSFSheet#write is
>> fast. If the order of rows is random or rows were shifted then
>> XSSFSheet#write involves CTSheetData#setRowArray and becomes more
>> expensive.
>> 
>> I committed this improvement in r947542. Please try the latest build
>> from trunk.
>> 
>> Below are two my tests, the only difference is that in the first test
>> rows are written in ascending order and in the second test in
>> descending.
>> 
>>      public static void main(String[] args) throws Exception {
>>          long t1 = System.currentTimeMillis();
>> 
>>          XSSFWorkbook wb = new XSSFWorkbook();
>>          XSSFSheet sh = wb.createSheet();
>>          for (int i = 0; i<  Short.MAX_VALUE; i++) {
>>              XSSFRow row = sh.createRow(i);
>>              for (int j = 0; j<  5; j++) {
>>                  XSSFCell cell = row.createCell(j);
>>                  cell.setCellValue(i*j);
>>              }
>>          }
>>          long t2 = System.currentTimeMillis();
>>          System.out.println("generate: " + (t2-t1) + " ms");
>> 
>>          FileOutputStream out = new FileOutputStream(new
>> File("/temp/rows.xlsx"));
>>          wb.write(out);
>>          out.close();
>>          long t3 = System.currentTimeMillis();
>>          System.out.println("write: " + (t3-t2) + " ms");
>>      }
>> 
>> 
>> generate: 5488 ms
>> write: 2507 ms
>> 
>>      public static void main(String[] args) throws Exception {
>>          long t1 = System.currentTimeMillis();
>> 
>>          XSSFWorkbook wb = new XSSFWorkbook();
>>          XSSFSheet sh = wb.createSheet();
>>          for (int i = Short.MAX_VALUE; i>  0; i--) {
>>              XSSFRow row = sh.createRow(i);
>>              for (int j = 0; j<  5; j++) {
>>                  XSSFCell cell = row.createCell(j);
>>                  cell.setCellValue(i*j);
>>              }
>>          }
>>          long t2 = System.currentTimeMillis();
>>          System.out.println("generate: " + (t2-t1) + " ms");
>> 
>>          FileOutputStream out = new FileOutputStream(new
>> File("/temp/rows.xlsx"));
>>          wb.write(out); //will involve CTSheetData#setRowArray
>>          out.close();
>>          long t3 = System.currentTimeMillis();
>>          System.out.println("write: " + (t3-t2) + " ms");
>>      }
>> 
>> generate: 9291 ms
>> write: 94888 ms
>> 
>> The first test is almost 40X faster!
>> 
>> Regards,
>> Yegor
>> 
>> 
>> 
>>   
>>> Continuing to chip away at CTSheetDataImpl's performance issues.
>>> However, the problematic code is unlikely to be changed, I have a
>>>     
>> discussion going on the XMLBeans website, and as it turns out that code
>> is most likely not going to change, or will not change easily.
>>   
>>> 
>>> With that in mind, I started taking a close look at the POI side of
>>>     
>> things and very specifically the problematic code from a POI
>> perspective.
>>   
>>> 
>>> At approximately line 2368 of XSSFSheet.java
>>> 
>>> the following line is the suspect line:
>>> I have modified the code to break 1 line into 3  the slow line is the
>>>     
>> last one.
>>   
>>>          CTRow[] lctRow = new CTRow[rArray.size()];
>>>     lctRow = rArray.toArray(lctRow);
>>>          sheetData.setRowArray(lctRow);
>>> 
>>> 
>>> 
>>> My Question is:  in looking at XSSFSheet.java
>>> are there any assumption that we can make about the state of sheetData
>>>     
>> (A CTSheetData object)
>>   
>>> that would allow us to either overwrite the default
>>>     
>> setRowArray(lctRow) method,
>>   
>>> or side step this call all together by calling a method that is not
>>>     
>> generic, but instead much
>>   
>>> more targeted to the function and process that is currently going on.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Here is the full call in context...Take from my modified version of
>>>     
>> XSSFSheet.java
>>   
>>> 
>>>      protected void write(OutputStream out) throws IOException {
>>> 
>>>          if(worksheet.getColsArray().length == 1) {
>>>              CTCols col = worksheet.getColsArray(0);
>>>              CTCol[] cols = col.getColArray();
>>>              if(cols.length == 0) {
>>>                  worksheet.setColsArray(null);
>>>              }
>>>          }
>>> 
>>>          // Now re-generate our CTHyperlinks, if needed
>>>          if(hyperlinks.size()>   0) {
>>>              if(worksheet.getHyperlinks() == null) {
>>>                  worksheet.addNewHyperlinks();
>>>              }
>>>              CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()];
>>>              for(int i=0; i<ctHls.length; i++) {
>>>                  // If our sheet has hyperlinks, have them add
>>>                  //  any relationships that they might need
>>>                  XSSFHyperlink hyperlink = hyperlinks.get(i);
>>>                  hyperlink.generateRelationIfNeeded(getPackagePart());
>>>                  // Now grab their underling object
>>>                  ctHls[i] = hyperlink.getCTHyperlink();
>>>              }
>>>              worksheet.getHyperlinks().setHyperlinkArray(ctHls);
>>>          }
>>> 
>>>          CTSheetData sheetData = worksheet.getSheetData();
>>>          ArrayList<CTRow>   rArray = new ArrayList<CTRow>(rows.size());
>>>          for(XSSFRow row : rows.values()){
>>>              row.onDocumentWrite();
>>>              rArray.add(row.getCTRow());
>>>          }
>>>             long startTimeSheetData = System.currentTimeMillis();
>>>             CTRow[] lctRow = new CTRow[rArray.size()];
>>>             lctRow = rArray.toArray(lctRow);
>>>          sheetData.setRowArray(lctRow);
>>>             System.out.println("Sheet Time : " +
>>>     
>> (System.currentTimeMillis() - startTimeSheetData));
>>   
>>> 
>>>          XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
>>>          xmlOptions.setSaveSyntheticDocumentElement(new
>>>     
>> QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));
>>   
>>>          Map<String, String>   map = new HashMap<String, String>();
>>>          map.put(STRelationshipId.type.getName().getNamespaceURI(),
>>>     
>> "r");
>>   
>>>          xmlOptions.setSaveSuggestedPrefixes(map);
>>>             long startTimeWorkSheetSave =
>>>     
>> System.currentTimeMillis();
>>   
>>>          worksheet.save(out, xmlOptions);
>>>             System.out.println("Worksheet save Time : " +
>>>     
>> (System.currentTimeMillis() - startTimeWorkSheetSave));
>>   
>>>      }
>>> 
>>> 
>>> 
>>> Best Regards
>>> Bryce Alcock
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>>> 
>>> 
>>>     
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 
>>   
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Performance Question with CTSheetDataImpl.java

Reply via email to