[GitHub] [spark] WeichenXu123 commented on pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

via GitHub Wed, 26 Apr 2023 21:14:27 -0700


WeichenXu123 commented on PR #40724:
URL: https://github.com/apache/spark/pull/40724#issuecomment-1524671277


   > @mengxr raises another suggestion: uses petastorm to load data from DBFS / 
HDFS /.. .(so that it can make torch distributor has a simpler interfaces). But 
there’s a shortcoming that it is low performant for sparse vector features. We 
haven’t made final decision yet.
   
   Finally, after offline discussion, we decided to adopt the approach of this 
PR, because this PR approach has significant benefits including:
   
    - It does not need to dump training dataset to distributed file system but 
just saving partition data in local disk, which is much faster.
    - If we uses petastorm or pytorch parquet / arrow loader, we have to 
densify sparse feature input data, it makes data exploded before saving 
dataset, this causes further performance deterioration. But current approach in 
this PR it dumps sparse data to local disk and when loading for training, it 
densifies the data.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] WeichenXu123 commented on pull request #40724: [SPARK-43081] [ML] [CONNECT] Add torch distributor data loader that loads data from spark partition data

Reply via email to