[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

ASF GitHub Bot (JIRA) Fri, 15 Dec 2017 09:54:42 -0800

    [ 
https://issues.apache.org/jira/browse/NIFI-4496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292922#comment-16292922
 ]


ASF GitHub Bot commented on NIFI-4496:
--------------------------------------

Github user markap14 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2245#discussion_r157261143
  
    --- Diff: 
nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/csv/JacksonCSVRecordReader.java
 ---
    @@ -136,7 +134,7 @@ public Record nextRecord(final boolean coerceTypes, 
final boolean dropUnknownFie
     
                 // If the first record is the header names (and we're using 
them), store those off for use in creating the value map on the next iterations
                 if (rawFieldNames == null) {
    -                if (hasHeader && ignoreHeader) {
    +                if (!hasHeader || ignoreHeader) {
                         rawFieldNames = schema.getFieldNames();
                     } else {
                         rawFieldNames = Arrays.stream(csvRecord).map((a) -> {
    --- End diff --
    
    I'm not sure that I understand the logic here... was this perhaps due to 
some refactoring and got overlooked, or is this actually doing something that's 
just not obvious to me? Seems this could just be done as 
`Arrays.asList(csvRecord)`


> Improve performance of CSVReader
> --------------------------------
>
>                 Key: NIFI-4496
>                 URL: https://issues.apache.org/jira/browse/NIFI-4496
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Matt Burgess
>            Assignee: Matt Burgess
>
> During some throughput testing, it was noted that the CSVReader was not as 
> fast as desired, processing less than 50k records per second. A look at [this 
> benchmark|https://github.com/uniVocity/csv-parsers-comparison] implies that 
> the Apache Commons CSV parser (used by CSVReader) is quite slow compared to 
> others.
> From that benchmark it appears that CSVReader could be enhanced by using a 
> different CSV parser under the hood. Perhaps Jackson is the best choice, as 
> it is fast when values are quoted, and is a mature and maintained codebase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NIFI-4496) Improve performance of CSVReader

Reply via email to