thadguidry opened a new issue, #2235: URL: https://github.com/apache/hop/issues/2235
### Apache Hop version? 2.3.0 ### Java version? 17.0.4.1 ### Operating system Windows ### What happened? Wire up a simple workflow of Start -> unzip -> Success The .zip file is read from disk, and file contents are written correctly to target folder on same disk. The disk is a RAID 1 array with 7200 RPM HDD's. The single large 1.2GB .zip file has taken over 4 hours to unzip so far...and I am still waiting. Conversely on a different machine with similar HDD configuration, using 7Zip 64bit the extraction was completed in about 40 minutes. Looking at the code, I am wondering if the option for `if file exists` = SKIP is perhaps a likely suspect (where perhaps my strategy is to go as fast as possible and I don't care about checking if things exist are not across over 155,000 files, but just blast through and overwrite files, no matter what) ? I.E. maybe I should have chosen `if file exists` = OVERWRITE ? I don't know which settings would make zip file unpacking faster or the HopVfs streaming faster? Maybe there should have been a setting in the unzip dialog that allowed me to use more memory or cores? The other suspicion is that of the buffering and low amount of memory that is utilized in order to unpack ? Looking at Task Manager on Windows, I can see Hop stayed at 50% CPU utilization across all 8 of my i7-9700k cores. And where memory utilized was at 2.5GB. Can we do better? Of course. A general strategy perhaps: 1. The question is how can the HOP VFS architecture deal with unzipping faster and optionally use more of the hardware? 2. Maybe perhaps warn users of issues with a many-file .zip file that needs to be extracted for now, and with a note to perhaps use other unzip tooling when you are dealing with large, deep zip files and need to go very fast? 3. With a final goal to eventually make things faster with Hop's own unzip? ### Issue Priority Priority: 3 ### Issue Component Component: VFS -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
