eitsupi opened a new issue #12653: URL: https://github.com/apache/arrow/issues/12653
Having found the following description in the documentation, I tried the operation of scanning a dataset larger than memory and writing it to another dataset. https://arrow.apache.org/docs/python/dataset.html#writing-large-amounts-of-data > The above examples wrote data from a table. If you are writing a large amount of data you may not be able to load everything into a single in-memory table. Fortunately, the write_dataset() method also accepts an iterable of record batches. This makes it really simple, for example, to repartition a large dataset without loading the entire dataset into memory: ```python import pyarrow.dataset as ds input_dataset = ds.dataset("input") ds.write_dataset(inpute_dataset.scanner(), "output", format="parquet") ``` ```r arrow::open_dataset("input") |> arrow::write_dataset("output") ``` But both Python and R on Windows crashed due to lack of memory. Am I missing something? Is there a recommended way to convert one dataset to another without running out of computer memory? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
