https://bz.apache.org/bugzilla/show_bug.cgi?id=61832

--- Comment #15 from [email protected] ---
I do not agree with this at all: "Some tasks just take a large amount of RAM,
which is cheap these days, including the ability to "rent" time and space on
any number of cloud provider platforms [...]".

This is the kind of reasoning that leads to bloatware and all around bad
products and bad libraries. Here is an approach that might work for improving
performance well within the OP requirements. I'd be interested to hear any
reasons this would not work.

1. shared string table uses a lookup that only tracks hash and index
2. worksheets and shared string table are output as streams to temporary files
(optionally compressed prior to write, which often saves time given speed of
CPU and sluggishness of write operations)
3. after all pieces from #2 are created, the final xlsx is assembled

Assuming a 256-bit hash, 1 mm rows, and 150 columns, and 100% non-distinct
string values (obviously unlikely) step 1 will take a maximum of
(256+32)/8*150*1mm = 5.4GB of memory. This could be further streamlined by not
hashing string values of under 32 bytes and/or using a 128-bit hash, and could
be further supplemented by supporting hash collisions. Of course there are
libraries that will take care of this so it shouldn't be much work to
incorporate. Something that is performance-oriented like sqlite3 comes to mind
as a possibility.

Step 2 will take minimal fixed memory because it is streaming the data to disk.

Step 3 can release all memory before starting, and then stream all the pieces,
in serial, into the final output (and implicitly through a zip streamer), which
can be either streamed to the calling process or to disk and thereby take up
minimal fixed memory.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to