josephglanville opened a new issue #10229:
URL: https://github.com/apache/druid/issues/10229


   When attempting to ingest Avro OCF files via the web console from local 
files I encountered 2 problems.
   The first is the sampler API failing with the following error message when 
attempting to ingest the SomeAvroDatum test file used in the InputFormat tests:
   ```
   2020-08-01T10:02:21,468 WARN [qtp1391890442-156] 
org.eclipse.jetty.server.HttpChannel - handleException 
/druid/indexer/v1/sampler 
com.fasterxml.jackson.databind.exc.InvalidDefinitionException: No serializer 
found for class org.apache.avro.generic.GenericData$Fixed and no properties 
discovered to create BeanSerializer (to avoid exception, disable 
SerializationFeature.FAIL_ON_EMPTY_BEANS) (through reference chain: 
org.apache.druid.indexing.overlord.sampler.SamplerResponse["data"]->java.util.ArrayList[0]->org.apache.druid.indexing.overlord.sampler.SamplerResponse$SamplerResponseRow["input"]->java.util.HashMap["someFixed"])
   ```
   
   I was able to fix this by disabling said feature on the default object 
mapper (because I couldn't find how the mapper for the sampler is initialised) 
but I don't know if that is a reasonable fix or if something more scoped can be 
done.
   
   Additionally there are 2 bugs with format detection right now. The first is 
my fault, the byte prefix used to match Avro OCF files is wrong. It attempts to 
match `Obj1` all as ASCII char codes however the 1 is actually a byte value so 
this is incorrect.
   However this doesn't actually come into play as for some reason the first 
few bytes of the sample datum are missing by the time the selection logic runs.
   The file is correct as can be seen with the hexdump:
   ```
   00000000  4f 62 6a 01 02 16 61 76  72 6f 2e 73 63 68 65 6d  
|Obj...avro.schem|
   00000010  61 f4 12 7b 22 74 79 70  65 22 3a 22 72 65 63 6f  
|a..{"type":"reco|
   ```
   And is loaded correctly after fixing the the mapper serialisation settings.
   
   In summary:
   - Format detection for Arvo OCF is wrong because the 4th byte should be 0x01 
not 0x31 (ASCII `1`)
   - Even if above check was correct it would fail because something is causing 
missing data ahead of format detection
   - Serialisation of the SamplerResponse is failing on Avro classes that blow 
up with `FAIL_ON_EMPTY_BEANS` being disabled.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to