[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

thunterdb Thu, 26 Oct 2017 03:06:33 -0700

Github user thunterdb commented on the issue:

    https://github.com/apache/spark/pull/19439
  
    @hhbyyh I recall now the reason for an extra `origin` field, which is to 
get around the standard issue of many small image files in S3 or other 
distributed file systems. It is standard to compact many small images into 
larger zip files, and the original `readImages` implementation could 
recursively traverse zip files to deal with that. This is a feature that we 
would like to add again at some point.
    
    When you compact multiple images in a single zip file, though, the filename 
is that of the zip file, so having an extra `origin` field is convenient to 
name the image correctly. This field is optional and this format is still 
experimental, so I do not think it is going to be an issue to deprecate this 
field if it is deemed to be too much trouble.
    
    Here is for example a relevant issue that we have in Deep Learning 
Pipelines, which is very representative of normal scenarios:
    https://github.com/databricks/spark-deep-learning/issues/67
    The current workaround is suboptimal in terms of performance and user 
experience.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

Reply via email to