[jira] [Commented] (NIFI-912) Support extracting metadata from Avro file headers

Sean Busbey (JIRA) Wed, 09 Sep 2015 06:14:03 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736813#comment-14736813
 ]


Sean Busbey commented on NIFI-912:
----------------------------------


{code}
+        @WritesAttribute(attribute = "schema.fingerprint", description = "The 
fingerprint of the schema as determined by the Fingerprint Algorithm.")
{code}

Note that the fingerprint will be a hex-string?


{code}
+    static final PropertyDescriptor COUNT_RECORDS = new 
PropertyDescriptor.Builder()
+            .name("Count Records")
+            .description("If true the number of records in the datafile will 
be counted and stored in a FlowFile attribute 'record.count'.")
+            .addValidator(StandardValidators.BOOLEAN_VALIDATOR)
+            .allowableValues("true", "false")
+            .defaultValue("false")
+            .required(true)
+            .build();
+
{code}

nit: worth noting that we do this by looking at metadata within the datafile 
format and not by e.g. deserializing the records?

{code}
+    @Test
+    public void testExtractionWithNonRecordSchema() throws IOException {
+        final TestRunner runner = TestRunners.newTestRunner(new 
ExtractAvroMetadata());
+        final Schema schema = new Schema.Parser().parse(new 
File("src/test/resources/array.avsc"));
+
+        final GenericData.Array<String> data = new GenericData.Array<>(schema, 
Arrays.asList("one", "two", "three"));
+        final DatumWriter<GenericData.Array<String>> datumWriter = new 
GenericDatumWriter<>(schema);
+
+        final ByteArrayOutputStream out = new ByteArrayOutputStream();
+        final DataFileWriter<GenericData.Array<String>> dataFileWriter = new 
DataFileWriter<>(datumWriter);
+        dataFileWriter.create(schema, out);
+        dataFileWriter.append(data);
+        dataFileWriter.close();
+
+        runner.enqueue(out.toByteArray());
+        runner.run();
+
+        runner.assertAllFlowFilesTransferred(ConvertAvroToJSON.REL_SUCCESS, 1);
+
+        final MockFlowFile flowFile = 
runner.getFlowFilesForRelationship(ExtractAvroMetadata.REL_SUCCESS).get(0);
+        
flowFile.assertAttributeExists(ExtractAvroMetadata.SCHEMA_FINGERPRINT_ATTR);
+        flowFile.assertAttributeEquals(ExtractAvroMetadata.SCHEMA_TYPE_ATTR, 
Schema.Type.ARRAY.getName());
+        flowFile.assertAttributeEquals(ExtractAvroMetadata.SCHEMA_NAME_ATTR, 
"array");
+    }
{code}

Maybe "record count" was a bad choice of name on my part? We should be able to 
get the count of data in this flow too, right?

> Support extracting metadata from Avro file headers
> --------------------------------------------------
>
>                 Key: NIFI-912
>                 URL: https://issues.apache.org/jira/browse/NIFI-912
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Bryan Bende
>            Assignee: Bryan Bende
>            Priority: Minor
>             Fix For: 0.4.0
>
>         Attachments: NIFI-912-2.patch, NIFI-912.patch
>
>
> Extract metadata from Avro file headers to FlowFile attributes so that 
> downstream processors can make decisions, such as merging together records of 
> compatible schemas (i.e. the correlation attribute).
> Information to extract:
> - Schema definition (full, not fp)
> - Schema fingerprint
> - Schema root record name (if schema is a record)
> - Key/value metadata, like compression codec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-912) Support extracting metadata from Avro file headers

Reply via email to