Github user thunterdb commented on the issue:
https://github.com/apache/spark/pull/19439
@hhbyyh I recall now the reason for an extra `origin` field, which is to
get around the standard issue of many small image files in S3 or other
distributed file systems. It is standard to compact many small images into
larger zip files, and the original `readImages` implementation could
recursively traverse zip files to deal with that. This is a feature that we
would like to add again at some point.
When you compact multiple images in a single zip file, though, the filename
is that of the zip file, so having an extra `origin` field is convenient to
name the image correctly. This field is optional and this format is still
experimental, so I do not think it is going to be an issue to deprecate this
field if it is deemed to be too much trouble.
Here is for example a relevant issue that we have in Deep Learning
Pipelines, which is very representative of normal scenarios:
https://github.com/databricks/spark-deep-learning/issues/67
The current workaround is suboptimal in terms of performance and user
experience.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]