Re: Performance Question with CTSheetDataImpl.java

Yegor Kozlov Sat, 29 May 2010 04:21:36 -0700

On 5/27/10 12:49 AM, [email protected] wrote:

Couple of thoughts:


1.  I did a test with "real" data in a controlled environment, and
Identified that in the real world condition this change did help
significantly for large sheets that did not require re-ordering.  12 Min
processing time dropped to 4ish Minutes.

I made a similar improvement in XSSFRow#onDocumentWrite(), this methodalways re-assigned the array of CTCell even if the cells were ordered.The performance boost is not so big as we got from the optimization ofXSSFSheet#write. The number of cells is usually small and the call ofCTRow#setCArray is not that expensive, the improvement should besignificant with 'wide' sheets having more than 100 columns.


The fix was committed in r949375

2.  I have been giving it a bit more thought, and would like to through
the idea of "block" re-order comparison for a further improvements where
the data does not match.  The basic idea is that because the rows are
maintained in a linked list in xml beans, can we on the poi side of
things look at only comparing blocks of data and doing some of the list
manipulation as needed, but skip blocks where we know or can predict
that the underlying list is ordered correctly.

I am just floating this idea to the list.  I will not be able to dig
into it for a bit...
I would love to hear thoughts?

I don't quite understand. Are you suggesting to manipulate low-level DOMobjects and swap only those pieces that are not sorted?In theory you can do that using XmlCursor but the code will be trickyand harder to maintain. Also, the performance gain is not evident to me.

An alternative approach is to improve XSSFSheet#createRow so that thearray of CTRow beans is kept in a sorted state. Current implementationappends new rows, this is why we need the trick with re-ordering toensure that the resulting XML is valid. If we do this optimization thenthe re-ordering trick is extra and can be removed. The performance ofXSSFSheet#createRow can be slower (random insertions in LinkedList!) butthe performance of XSSFSheet#write will be definitely faster.


Yegor

Bryce Alcock * Performance Architect * SunGard * Asset Arena *
377 E. Butterfield Road Suite 800, Lombard, IL 60148 *
www.sungard.com/assetarena

Think before you print



-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Monday, May 24, 2010 10:06 AM
To: [email protected]; [email protected]
Subject: RE: Performance Question with CTSheetDataImpl.java

The performance improvement is significant for my simple "Laboratory"
test.
See attached PNG file showing Yegor's code vs. Poi before the change.


Yegor's Code Changes Inplace:
Note that the AVG Time per Row does not change (O(1) per row leads to
O(n) over all)

File Name               Rows    Total   AVG
                                Time    Mills/
                                Millis  Row     
./20000_true.xls        20000   1964    0.1     
./24000_true.xls        24000   2019    0.08    
./28800_true.xls        28800   2624    0.09    
./34560_true.xls        34560   3012    0.09    
./41472_true.xls        41472   3409    0.08    
./49766_true.xls        49766   4617    0.09    
./59719_true.xls        59719   6162    0.1     
./71662_true.xls        71662   5705    0.08    
./85994_true.xls        85994   7487    0.09    
./103192_true.xls       103192  9060    0.09    
./123830_true.xls       123830  11875   0.1     
./148596_true.xls       148596  13591   0.09    


Original Poi (pre row order check)
Notice that the AVG Mills/Row increases linearly with rows
(O(n) which leads to O(n^2) as seen in the Total Time)

File Name               Rows    Total   AVG
                                Time    Mills/
                                Millis  Row     
./20000_true.xls        20000   4655    0.23
./24000_true.xls        24000   6398    0.27
./28800_true.xls        28800   7626    0.26
./34560_true.xls        34560   10178   0.29
./41472_true.xls        41472   17647   0.43
./49766_true.xls        49766   24645   0.5
./59719_true.xls        59719   38765   0.65
./71662_true.xls        71662   63009   0.88
./85994_true.xls        85994   98448   1.14
./103192_true.xls       103192  144313  1.4
./123830_true.xls       123830  217893  1.76
./148596_true.xls       148596  320606  2.16



Thanks for your effort's Yegor.
I will test this in the "wild" with real data soon.


Bryce Alcock * Performance Architect * SunGard * Asset Arena * 377 E.
Butterfield Road Suite 800, Lombard, IL 60148 * Tel 630-986-3006 *
Confidential Fax 630-515-1908 * www.sungard.com/assetarena



-----Original Message-----
From: Yegor Kozlov [mailto:[email protected]]
Sent: Mon 5/24/2010 12:58 AM
To: POI Developers List
Cc: Alcock, Bryce
Subject: Re: Performance Question with CTSheetDataImpl.java

the purpose of calling CTSheetData#setRowArray is to ensure that the
array of CTRow beans in CTSheetData is ordered. The point is that we
don't always need this operation, in many cases rows are already ordered
and re-assigning them to CTSheetData is an unnecessary expensive
operation.

Compare two use cases:

case 1: rows are written in increasing order

XSSFSheet sheet = workbook.createSheet(); sheet.createRow(1);
sheet.createRow(2); sheet.createRow(3); workbook.write(out); //no need
to re-order rows before writing


case 2: row are created in random order

XSSFSheet sheet = workbook.createSheet(); sheet.createRow(3);
sheet.createRow(2); sheet.createRow(1); workbook.write(out); //need to
re-order rows before writing

In the first case the call of CTSheetData#setRowArray is extra because
the state of CTSheetData match the logical model. In the second case we
do need to call CTSheetData#setRowArray to ensure that rows are written
in ascending order.

So, the performance of XSSFSheet#write depends on how rows were added:
if rows are added in strict ascending order then XSSFSheet#write is
fast. If the order of rows is random or rows were shifted then
XSSFSheet#write involves CTSheetData#setRowArray and becomes more
expensive.

I committed this improvement in r947542. Please try the latest build
from trunk.

Below are two my tests, the only difference is that in the first test
rows are written in ascending order and in the second test in
descending.

      public static void main(String[] args) throws Exception {
          long t1 = System.currentTimeMillis();

          XSSFWorkbook wb = new XSSFWorkbook();
          XSSFSheet sh = wb.createSheet();
          for (int i = 0; i<  Short.MAX_VALUE; i++) {
              XSSFRow row = sh.createRow(i);
              for (int j = 0; j<  5; j++) {
                  XSSFCell cell = row.createCell(j);
                  cell.setCellValue(i*j);
              }
          }
          long t2 = System.currentTimeMillis();
          System.out.println("generate: " + (t2-t1) + " ms");

          FileOutputStream out = new FileOutputStream(new
File("/temp/rows.xlsx"));
          wb.write(out);
          out.close();
          long t3 = System.currentTimeMillis();
          System.out.println("write: " + (t3-t2) + " ms");
      }


generate: 5488 ms
write: 2507 ms

      public static void main(String[] args) throws Exception {
          long t1 = System.currentTimeMillis();

          XSSFWorkbook wb = new XSSFWorkbook();
          XSSFSheet sh = wb.createSheet();
          for (int i = Short.MAX_VALUE; i>  0; i--) {
              XSSFRow row = sh.createRow(i);
              for (int j = 0; j<  5; j++) {
                  XSSFCell cell = row.createCell(j);
                  cell.setCellValue(i*j);
              }
          }
          long t2 = System.currentTimeMillis();
          System.out.println("generate: " + (t2-t1) + " ms");

          FileOutputStream out = new FileOutputStream(new
File("/temp/rows.xlsx"));
          wb.write(out); //will involve CTSheetData#setRowArray
          out.close();
          long t3 = System.currentTimeMillis();
          System.out.println("write: " + (t3-t2) + " ms");
      }

generate: 9291 ms
write: 94888 ms

The first test is almost 40X faster!

Regards,
Yegor

Continuing to chip away at CTSheetDataImpl's performance issues.
However, the problematic code is unlikely to be changed, I have a

discussion going on the XMLBeans website, and as it turns out that code
is most likely not going to change, or will not change easily.


With that in mind, I started taking a close look at the POI side of

things and very specifically the problematic code from a POI
perspective.


At approximately line 2368 of XSSFSheet.java

the following line is the suspect line:
I have modified the code to break 1 line into 3  the slow line is the

last one.

          CTRow[] lctRow = new CTRow[rArray.size()];
        lctRow = rArray.toArray(lctRow);
          sheetData.setRowArray(lctRow);



My Question is:  in looking at XSSFSheet.java
are there any assumption that we can make about the state of sheetData

(A CTSheetData object)

that would allow us to either overwrite the default

setRowArray(lctRow) method,

or side step this call all together by calling a method that is not

generic, but instead much

more targeted to the function and process that is currently going on.








Here is the full call in context...Take from my modified version of

XSSFSheet.java


      protected void write(OutputStream out) throws IOException {

          if(worksheet.getColsArray().length == 1) {
              CTCols col = worksheet.getColsArray(0);
              CTCol[] cols = col.getColArray();
              if(cols.length == 0) {
                  worksheet.setColsArray(null);
              }
          }

          // Now re-generate our CTHyperlinks, if needed
          if(hyperlinks.size()>   0) {
              if(worksheet.getHyperlinks() == null) {
                  worksheet.addNewHyperlinks();
              }
              CTHyperlink[] ctHls = new CTHyperlink[hyperlinks.size()];
              for(int i=0; i<ctHls.length; i++) {
                  // If our sheet has hyperlinks, have them add
                  //  any relationships that they might need
                  XSSFHyperlink hyperlink = hyperlinks.get(i);
                  hyperlink.generateRelationIfNeeded(getPackagePart());
                  // Now grab their underling object
                  ctHls[i] = hyperlink.getCTHyperlink();
              }
              worksheet.getHyperlinks().setHyperlinkArray(ctHls);
          }

          CTSheetData sheetData = worksheet.getSheetData();
          ArrayList<CTRow>   rArray = new ArrayList<CTRow>(rows.size());
          for(XSSFRow row : rows.values()){
              row.onDocumentWrite();
              rArray.add(row.getCTRow());
          }
                long startTimeSheetData = System.currentTimeMillis();
                CTRow[] lctRow = new CTRow[rArray.size()];
                lctRow = rArray.toArray(lctRow);
          sheetData.setRowArray(lctRow);
                System.out.println("Sheet Time : " +

(System.currentTimeMillis() - startTimeSheetData));


          XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
          xmlOptions.setSaveSyntheticDocumentElement(new

QName(CTWorksheet.type.getName().getNamespaceURI(), "worksheet"));

          Map<String, String>   map = new HashMap<String, String>();
          map.put(STRelationshipId.type.getName().getNamespaceURI(),

"r");

          xmlOptions.setSaveSuggestedPrefixes(map);
                long startTimeWorkSheetSave =

System.currentTimeMillis();

          worksheet.save(out, xmlOptions);
                System.out.println("Worksheet save Time : " +

(System.currentTimeMillis() - startTimeWorkSheetSave));

      }



Best Regards
Bryce Alcock


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]







---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Performance Question with CTSheetDataImpl.java

Reply via email to