alamb opened a new issue, #1717:
URL: https://github.com/apache/arrow-rs/issues/1717

   **Describe the bug**
   (from the mailing list)
   
   Apparently, you can make a program that appears to write a parquet file in 
parallel, but it will   currently produce corrupt parquet data.
   
   **To Reproduce**
   Description in the email says:
   
   > I was attempting to build a single Parquet from the batches in what I 
thought was a parallel manner using the ArrowWriter.  I tried to "parallelise" 
the following serial code. 
   
   ```rust
               let cursor = InMemoryWriteableCursor::default();
               let mut writer = ArrowWriter::try_new(cursor.clone(), schema, 
None)?;
               for batch in batches {
                   writer.write(batch)?;
               }
               writer.close()?;
   ```
   
   > I realised that although the compiler accepted my incorrect parallel 
version of this code, it in-fact was not sound which caused the corruption.
   
   **Expected behavior**
   The API should not allow corrupted data / produce a compiler error
   
   Note I will file a separate ticket for actually writing a parquet file in 
parallel. 
   
   **Additional context**
   Mailing list https://lists.apache.org/thread/rbhfwcpd6qfk52rtzm2t6mo3fhvdpc91
   
   Also, https://github.com/apache/arrow-rs/issues/1711 is possibly related


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to