Re: [PR] GH-38503: [Go][Parquet] Style improvement for using ArrowColumnWriter [arrow]

via GitHub Mon, 13 Nov 2023 19:50:36 -0800


mapleFU commented on PR #38581:
URL: https://github.com/apache/arrow/pull/38581#issuecomment-1809504173


   > So while I agree that the changes here improve the name of that last 
argument and make it easier to get the leaf index with the newly exported 
arrowColumnWriter.LeafCount() method, I'm not sure this makes for the most 
intuitive API. Since the function is called with an Arrow column chunk and the 
number of Arrow rows, wouldn't it make sense to call it with the Arrow column 
index as well (as done in the tests above)?
   
   This is not the most intuitive API. Generally `parquet/arrow` module need to 
convert arrow record batch to a parquet leaves, this need to be done in 
`parquet/arrow` 
   
   ```
   func (fw *FileWriter) WriteColumnChunked(data *arrow.Chunked, offset, size 
int64) error {
        acw, err := NewArrowColumnWriter(data, offset, size, fw.manifest, 
fw.rgw, fw.colIdx)
        if err != nil {
                return err
        }
        fw.colIdx += acw.leafCount
        return acw.Write(fw.ctx)
   }
   ```
   
   This API is much more easily, it's used in `WriteTable`. Also, C++ has a 
`WriteRecordBatch`, we use it as in-house parquet writer. I think it's easy to 
porting and use it.
   
   `ArrowColumnWriter` is neccessary for writing parquet, but I think it's a 
bit hack to using it as a export API. User need to always maintaining the 
offsets, array types etc.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-38503: [Go][Parquet] Style improvement for using ArrowColumnWriter [arrow]

Reply via email to