Github user thunterdb commented on the issue:
https://github.com/apache/spark/pull/19439
@hhbyyh thank you for bringing up these questions. In response to your
questions:
> Does the current schema support or plan to support image feature data in
Floats[] or Doubles[]?
It does, indirectly: this is what the field types `CV_32FXX` do. You need
to do some low-level casting to convert the byte array to array of numbers, but
that should not be a limitation for most applications.
> Correct me if I'm wrong, I don't see ocvTypes plays any role in the code.
If the field is for future extension. maybe It's better to keep only the
supported types. But not all the OpenCV types.
Indeed. These fields are added so that users know what values are allowed
for these fields. A scala-friendly choice would have been sealed traits or
enumerations, but the consensus in Spark has been for a low-level
representation. Nothing precludes adding a case class to represent this dataset
in the future, with more type safety information.
> In most scenarios, deep learning applications use rescaled/cropped images
(typically 256, 224 or smaller). Maybe add an extra parameter "smallSideSize"
to the readImages method, which is more convenient for the users and we can
avoid to cache the image of original size (which could be 100 times larger than
the scaled image). This can be done in follow up PR.
This is a good point, that we all hit. The issue here is that there is no
unique definition of rescaling (what interpolation? crop and scale? scale and
crop?) and each library has made different choices. This is certainly something
that would be good for a future PR.
> Not sure about the reason to include "origin" info into the image data.
Based on my experience, path info serves better as a separate column in the
DataFrame. (E.g. prediction)
Yes, this feature has been debated. Some developers have had a compelling
need for directly accessing some information about the origin of the image
directly inside the image.
> IMO the parameter "recursive" may not be necessary. Existing wild card
matching can provides more functions.
Indeed, this feature may not seem that useful at first glance. For some
hadoop file systems though, in which images are accessed in a batched manner,
it is useful to traverse these batches. This is important for performance
reasons. This is why it is marked as experimental for the time being.
> For scala API, ImageSchema should be in a separate file but not to be
mixed with image reading.
I do not have a strong opinion about this point, I will let other
developers decide.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]