[ 
https://issues.apache.org/jira/browse/AVRO-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985980#comment-16985980
 ] 

Ryan Skraba commented on AVRO-2636:
-----------------------------------

Thanks for the use case! 

As noted, BYTES is particularly bad because *reading* the ByteBuffer will 
change its state.

It gets a bit worse: the default value can be mutable RECORD, ARRAY, MAP 
values, which will likely behave badly if the user decides to modify these 
values in a multithreaded environment.  Those types can contain nested data of 
type BYTES.

Maybe the right thing to do *IS* to always return a `deepCopy` from this 
method... we might need to do fixes in the Avro internals to make sure we're 
not making unnecessary copies.  Even so, to check whethr we need a ThreadLocal 
cache for reentrant deepCopy calls.

While looking for how this would affect others, I took a quick look at [Parquet 
using 
{{getDefaultValue}}|https://github.com/apache/parquet-mr/blob/62dcc68acaf64012bf731e103be780956f1f446d/parquet-avro/src/main/java/org/apache/parquet/avro/AvroRecordConverter.java#L168-L172].
  It makes multiple calls to test for presence of a default, to get the default 
value, and it appears to throw away the copy if it's not needed.

I'll investigate a bit further, but opinions welcome!

> GenericData defaultValueCache caches mutable ByteBuffers
> --------------------------------------------------------
>
>                 Key: AVRO-2636
>                 URL: https://issues.apache.org/jira/browse/AVRO-2636
>             Project: Apache Avro
>          Issue Type: Bug
>          Components: java
>            Reporter: Valentin Nikotin
>            Priority: Minor
>
> It appears that for default value for Byte type (and Decimal logical type if 
> it uses underlying Bytes type) value rendered with getDefaultValue is cached. 
> This leads to bugs when you read the same value (for example if converted 
> with DecimalConversion). For single thread environment workaround would be to 
> reset ByteBuffer after read, but in concurrent environment we should not 
> cache mutable objects.
>  
> {code:java}
> @Test(expected=NumberFormatException.class)
> public void testReuse() {
>     Conversions.DecimalConversion decimalConversion =
>             new Conversions.DecimalConversion();
>     LogicalType logicalDecimal =
>             LogicalTypes.decimal(38, 9);
>     ByteBuffer defaultValue =
>             decimalConversion.toBytes(
>                     BigDecimal.valueOf(42L).setScale(9),
>                     null,
>                     logicalDecimal);
>     Schema schema = SchemaBuilder
>             .record("test")
>             .fields()
>             .name("decimal")
>             
> .type(logicalDecimal.addToSchema(SchemaBuilder.builder().bytesType()))
>             .withDefault(defaultValue)
>             .endRecord();
>     BigDecimal firstRead = decimalConversion
>             .fromBytes(
>                     (ByteBuffer) 
> GenericData.get().getDefaultValue(schema.getField("decimal")),
>                     null,
>                     logicalDecimal);
>     BigDecimal secondRead = decimalConversion
>             .fromBytes(
>                     (ByteBuffer) 
> GenericData.get().getDefaultValue(schema.getField("decimal")),
>                     null,
>                     logicalDecimal);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to