WeichenXu123 commented on PR #40724:
URL: https://github.com/apache/spark/pull/40724#issuecomment-1524671277
> @mengxr raises another suggestion: uses petastorm to load data from DBFS /
HDFS /.. .(so that it can make torch distributor has a simpler interfaces). But
there’s a shortcoming that it is low performant for sparse vector features. We
haven’t made final decision yet.
Finally, after offline discussion, we decided to adopt the approach of this
PR, because this PR approach has significant benefits including:
- It does not need to dump training dataset to distributed file system but
just saving partition data in local disk, which is much faster.
- If we uses petastorm or pytorch parquet / arrow loader, we have to
densify sparse feature input data, it makes data exploded before saving
dataset, this causes further performance deterioration. But current approach in
this PR it dumps sparse data to local disk and when loading for training, it
densifies the data.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]