zeroshade commented on issue #727: URL: https://github.com/apache/arrow-go/issues/727#issuecomment-4128576847
Looking through the code, one thing I've noticed is that we don't actually validate that the schema of the recordbatch matches the schema of the parquet file being written. In that scenario, if the schemas don't line up or if the record batch has more columns than is expected etc. that could result in the error you're encountering. For example, it attempts to load the encoder for a given int64 or string/binary column in the record batch, but there's no corresponding leaf column in the parquet schema. In that situation the encoder would be `nil` instead of the expected type. Can you try dumping the actual schema of the record batch being written at the point you encounter this error and validate it against the schema that was used when you started the file? Make sure they are *exactly* identical. As for the error you're encountering on the failure of WriteBuffered, that would be because it failed to write values to a particular column of the row group while writing values to the columns before that. But parquet requires ALL column chunks in a row group to have the same number of rows. So if it failed writing a chunk of rows to a column because of the failed WriteBuffered, the check performed when closing the file would trigger the error you're seeing that there are fewer rows in that column than in the metadata and in the other rows. This is probably a case where we can improve the error output to better report what is going on to the user. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
