JingsongLi commented on PR #276:
URL: https://github.com/apache/paimon-rust/pull/276#issuecomment-4301567204
@luoyuxia Thanks for the contribution!
Left comments from Claude 3.7:
1. from_avro_bytes collects all records into Value::Array then
deserializes the whole array (objects_file.rs, manifest.rs, index_manifest.rs).
This creates an intermediate Vec<Value> plus a wrapping Value::Array, then
from_value walks it again. For large files this doubles memory. Consider
deserializing each record individually:
reader
.map(|r| r.map_err(Error::from).and_then(|v|
from_value(&v).map_err(Error::from)))
.collect::<Result<Vec<T>>>()
1. This avoids the intermediate collection entirely.
2. get_field_at formatting — the match arm body has inconsistent brace
style that may trip rustfmt:
Value::Record(fields) => if let Some(i) = idx {
fields.get(i).map(|(_, v)| v)
} else {
fields.iter().find(|(n, _)| n == name).map(|(_, v)| v)
}
.and_then(unwrap_value),
2. The .and_then(unwrap_value) is chained on the if let expression, which
is valid but hard to read. Wrapping the if let in braces or extracting to a
local would be clearer.
3. field_index returns None on empty slices — this is fine, but
get_field_at with idx = None falls back to linear search. If the field
genuinely doesn't exist in the schema, every row will do a linear scan and find
nothing. Not a bug, but worth noting the fallback is O(n·m) in the degenerate
case (n rows, m fields).
4. build_map_column semantic change — the old code used get_field_raw (no
union unwrapping) for maps, because AvroValue::Object was ambiguous between
unions and maps. The new code uses get_field_at which calls unwrap_value and
thus unwraps unions. With apache_avro::types::Value, Map and Union are distinct
variants, so this is actually correct now. Just
flagging it as a subtle behavioral change that's worth a test if one
doesn't exist.
5. DataUnexpected error variant (error.rs) — the message field is always
set to "" in the From<apache_avro::Error> impl. Either drop the message field
from this variant (use only source), or populate it with something useful like
"avro error". An empty string in display("Paimon hitting unexpected avro error
{}: {:?}", message, source) looks odd in logs.
6. apache-avro 0.21 — HashMap iteration order in Value::Map — Avro maps
use HashMap internally, so iteration order in build_map_column is
non-deterministic. The old code also used HashMap, so this isn't a regression,
but it's worth noting that map key/value ordering in the Arrow MapArray output
is not guaranteed to be stable.
7. Minor: index_manifest.rs test — the roundtrip test now manually unwraps
Value::Union(_, inner). This is fine but somewhat fragile if the schema
changes. Consider using from_value directly on the decoded value since
from_value should handle union unwrapping.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]