[GitHub] [arrow] eitsupi opened a new issue #12653: Conversion from one dataset to another that will not fit in memory?

GitBox Wed, 16 Mar 2022 22:25:38 -0700


eitsupi opened a new issue #12653:
URL: https://github.com/apache/arrow/issues/12653



   Having found the following description in the documentation, I tried the 
operation of scanning a dataset larger than memory and writing it to another 
dataset.
   
   
https://arrow.apache.org/docs/python/dataset.html#writing-large-amounts-of-data
   
   > The above examples wrote data from a table. If you are writing a large 
amount of data you may not be able to load everything into a single in-memory 
table. Fortunately, the write_dataset() method also accepts an iterable of 
record batches. This makes it really simple, for example, to repartition a 
large dataset without loading the entire dataset into memory:
   
   ```python
   import pyarrow.dataset as ds
   
   input_dataset = ds.dataset("input")
   ds.write_dataset(inpute_dataset.scanner(), "output", format="parquet")
   ```
   
   ```r
   arrow::open_dataset("input") |>
     arrow::write_dataset("output")
   ```
   
   But both Python and R on Windows crashed due to lack of memory. Am I missing 
something?
   Is there a recommended way to convert one dataset to another without running 
out of computer memory?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] eitsupi opened a new issue #12653: Conversion from one dataset to another that will not fit in memory?

Reply via email to