https://bz.apache.org/bugzilla/show_bug.cgi?id=61832
--- Comment #15 from [email protected] --- I do not agree with this at all: "Some tasks just take a large amount of RAM, which is cheap these days, including the ability to "rent" time and space on any number of cloud provider platforms [...]". This is the kind of reasoning that leads to bloatware and all around bad products and bad libraries. Here is an approach that might work for improving performance well within the OP requirements. I'd be interested to hear any reasons this would not work. 1. shared string table uses a lookup that only tracks hash and index 2. worksheets and shared string table are output as streams to temporary files (optionally compressed prior to write, which often saves time given speed of CPU and sluggishness of write operations) 3. after all pieces from #2 are created, the final xlsx is assembled Assuming a 256-bit hash, 1 mm rows, and 150 columns, and 100% non-distinct string values (obviously unlikely) step 1 will take a maximum of (256+32)/8*150*1mm = 5.4GB of memory. This could be further streamlined by not hashing string values of under 32 bytes and/or using a 128-bit hash, and could be further supplemented by supporting hash collisions. Of course there are libraries that will take care of this so it shouldn't be much work to incorporate. Something that is performance-oriented like sqlite3 comes to mind as a possibility. Step 2 will take minimal fixed memory because it is streaming the data to disk. Step 3 can release all memory before starting, and then stream all the pieces, in serial, into the final output (and implicitly through a zip streamer), which can be either streamed to the calling process or to disk and thereby take up minimal fixed memory. -- You are receiving this mail because: You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
