alamb commented on code in PR #9648:
URL: https://github.com/apache/arrow-datafusion/pull/9648#discussion_r1530533085
##########
datafusion/core/src/datasource/file_format/write/mod.rs:
##########
@@ -150,19 +74,13 @@ pub trait BatchSerializer: Sync + Send {
fn serialize(&self, batch: RecordBatch, initial: bool) -> Result<Bytes>;
}
-/// Returns an [`AbortableWrite`] which writes to the given object store
location
+/// Returns an [`AsyncWrite`] which writes to the given object store location
/// with the specified compression
pub(crate) async fn create_writer(
file_compression_type: FileCompressionType,
location: &Path,
object_store: Arc<dyn ObjectStore>,
-) -> Result<AbortableWrite<Box<dyn AsyncWrite + Send + Unpin>>> {
- let (multipart_id, writer) = object_store
Review Comment:
For anyone else following along, the `BufWriter` internally does a
multi-part put when appropriate
##########
datafusion/core/src/datasource/file_format/write/mod.rs:
##########
@@ -69,79 +66,6 @@ impl Write for SharedBuffer {
}
}
-/// Stores data needed during abortion of MultiPart writers
Review Comment:
Is it correct to say that the implications of removing `AbortableWrite` is
that if certain (larger) writes to object store fail / abort for some reason,
"garbage" (unreferenced partial uploads) may be left around indefinitely on the
provider?
While I understand that some object stores (maybe all) can be configured to
automatically clean up such parts, I think reverting the "try to cleanup on
failure" behavior is worth reconsidering.
I think I could be convinced with an argument like "the software can't
ensure clean up anyways (for example, if it is `SIGKILL`ed) for some reason,
and thus we don't even try to clean up in paths we could", but if we go that
route I think we should explicitly document the behavior and rationale in
comments somewhere
I think @metesynnada or @mustafasrepo originally added this code (though I
may be wrong) so perhaps they have some perspective to share
##########
datafusion/core/src/datasource/file_format/file_compression_type.rs:
##########
@@ -152,7 +153,7 @@ impl FileCompressionType {
/// according to this `FileCompressionType`.
pub fn convert_async_writer(
&self,
- w: Box<dyn AsyncWrite + Send + Unpin>,
+ w: BufWriter,
Review Comment:
I think it depends on if this code is always used with `object_store` (aka
if DataFusion code always writes output using the `object_store` API). If this
is the case, then switching to `BufWriter` here makes sense to me
BTW I think we need to update the comments on this function to match the new
implementation
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]