[ 
https://issues.apache.org/jira/browse/NIFI-7946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jasper Knulst updated NIFI-7946:
--------------------------------
    Description: 
I would be really great to have an additional property for the CSVReader 
controller service to treat multiple consecutive delimiter occurrences as only 
one. This a something you can do in Excel for instance.

There are many CSV like formats that have multiple delimiters following each 
other, for instance to aid in aligning columns (set to 'text-mode' first to see 
the diff):

Device          rReq_PS      wReq_PS        rKB_PS        wKB_PS  avgWaitMillis 
  avgSvcMillis   bandwUtilPct
md1                 0.0          0.0           0.0           0.0            0.0 
           0.0              0
md10                0.0          8.0           0.0          75.8            0.0 
           7.5              5
md11                0.0          0.0           0.0           0.0            0.0 
           0.0              0
md20                0.0          8.0           0.0          75.8            0.0 
           6.7              5
md21                0.0          0.0           0.0           0.0            0.0 
           0.0              0
md30                0.0          0.0           0.0           0.0            0.0 
           0.0              0
md100               0.0          8.0           0.0          75.8            0.0 
           8.1              6
sd0                 0.0          0.0           0.0           0.0            0.0 
           0.0              0
sd1                 0.0          8.0           0.0          75.8            0.0 
           6.6              5
sd2                 0.0          0.0           0.0           0.0            0.0 
           0.0              0
sd3                 0.0          8.0           0.0          75.8            0.0 
           7.4              5

Executing the CSV Reader with " " as delimiter on the above leads to havoc. If 
only the CSVReader would treat the input as below:

Device rReq_PS wReq_PS rKB_PS wKB_PS avgWaitMillis avgSvcMillis bandwUtilPct
md1 0.0 0.0 0.0 0.0 0.0 0.0 0
md10 0.0 8.0 0.0 75.8 0.0 7.5 5
md11 0.0 0.0 0.0 0.0 0.0 0.0 0
md20 0.0 8.0 0.0 75.8 0.0 6.7 5
md21 0.0 0.0 0.0 0.0 0.0 0.0 0
md30 0.0 0.0 0.0 0.0 0.0 0.0 0
md100 0.0 8.0 0.0 75.8 0.0 8.1 6
sd0 0.0 0.0 0.0 0.0 0.0 0.0 0
sd1 0.0 8.0 0.0 75.8 0.0 6.6 5
sd2 0.0 0.0 0.0 0.0 0.0 0.0 0
sd3 0.0 8.0 0.0 75.8 0.0 7.4 5

I would go well. I know that a ReplaceText processor could do the same easily 
as a preceding step, but this is not always possible (with no-input processors 
like TCPRecordReader) and I also believe less processors is better 

  was:
I would be really great to have an additional property for the CSVReader 
controller service to treat multiple consecutive delimiter occurrences as only 
one. This a something you can do in Excel for instance.

There are many CSV like formats that have multiple delimiters following each 
other, for instance to aid in aligning columns:

Device          rReq_PS      wReq_PS        rKB_PS        wKB_PS  avgWaitMillis 
  avgSvcMillis   bandwUtilPct
md1                 0.0          0.0           0.0           0.0            0.0 
           0.0              0
md10                0.0          8.0           0.0          75.8            0.0 
           7.5              5
md11                0.0          0.0           0.0           0.0            0.0 
           0.0              0
md20                0.0          8.0           0.0          75.8            0.0 
           6.7              5
md21                0.0          0.0           0.0           0.0            0.0 
           0.0              0
md30                0.0          0.0           0.0           0.0            0.0 
           0.0              0
md100               0.0          8.0           0.0          75.8            0.0 
           8.1              6
sd0                 0.0          0.0           0.0           0.0            0.0 
           0.0              0
sd1                 0.0          8.0           0.0          75.8            0.0 
           6.6              5
sd2                 0.0          0.0           0.0           0.0            0.0 
           0.0              0
sd3                 0.0          8.0           0.0          75.8            0.0 
           7.4              5

Executing the CSV Reader with " " as delimiter on the above leads to havoc. If 
only the CSVReader would treat the input as below:

Device rReq_PS wReq_PS rKB_PS wKB_PS avgWaitMillis avgSvcMillis bandwUtilPct
md1 0.0 0.0 0.0 0.0 0.0 0.0 0
md10 0.0 8.0 0.0 75.8 0.0 7.5 5
md11 0.0 0.0 0.0 0.0 0.0 0.0 0
md20 0.0 8.0 0.0 75.8 0.0 6.7 5
md21 0.0 0.0 0.0 0.0 0.0 0.0 0
md30 0.0 0.0 0.0 0.0 0.0 0.0 0
md100 0.0 8.0 0.0 75.8 0.0 8.1 6
sd0 0.0 0.0 0.0 0.0 0.0 0.0 0
sd1 0.0 8.0 0.0 75.8 0.0 6.6 5
sd2 0.0 0.0 0.0 0.0 0.0 0.0 0
sd3 0.0 8.0 0.0 75.8 0.0 7.4 5

I would go well. I know that a ReplaceText processor could do the same easily 
as a preceding step, but this is not always possible (with no-input processors 
like TCPRecordReader) and I also believe less processors is better 


> Add property to CSVReader to treat multiple delimiters as 1
> -----------------------------------------------------------
>
>                 Key: NIFI-7946
>                 URL: https://issues.apache.org/jira/browse/NIFI-7946
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>    Affects Versions: 1.12.1
>            Reporter: Jasper Knulst
>            Assignee: Jasper Knulst
>            Priority: Major
>
> I would be really great to have an additional property for the CSVReader 
> controller service to treat multiple consecutive delimiter occurrences as 
> only one. This a something you can do in Excel for instance.
> There are many CSV like formats that have multiple delimiters following each 
> other, for instance to aid in aligning columns (set to 'text-mode' first to 
> see the diff):
> Device          rReq_PS      wReq_PS        rKB_PS        wKB_PS  
> avgWaitMillis   avgSvcMillis   bandwUtilPct
> md1                 0.0          0.0           0.0           0.0            
> 0.0            0.0              0
> md10                0.0          8.0           0.0          75.8            
> 0.0            7.5              5
> md11                0.0          0.0           0.0           0.0            
> 0.0            0.0              0
> md20                0.0          8.0           0.0          75.8            
> 0.0            6.7              5
> md21                0.0          0.0           0.0           0.0            
> 0.0            0.0              0
> md30                0.0          0.0           0.0           0.0            
> 0.0            0.0              0
> md100               0.0          8.0           0.0          75.8            
> 0.0            8.1              6
> sd0                 0.0          0.0           0.0           0.0            
> 0.0            0.0              0
> sd1                 0.0          8.0           0.0          75.8            
> 0.0            6.6              5
> sd2                 0.0          0.0           0.0           0.0            
> 0.0            0.0              0
> sd3                 0.0          8.0           0.0          75.8            
> 0.0            7.4              5
> Executing the CSV Reader with " " as delimiter on the above leads to havoc. 
> If only the CSVReader would treat the input as below:
> Device rReq_PS wReq_PS rKB_PS wKB_PS avgWaitMillis avgSvcMillis bandwUtilPct
> md1 0.0 0.0 0.0 0.0 0.0 0.0 0
> md10 0.0 8.0 0.0 75.8 0.0 7.5 5
> md11 0.0 0.0 0.0 0.0 0.0 0.0 0
> md20 0.0 8.0 0.0 75.8 0.0 6.7 5
> md21 0.0 0.0 0.0 0.0 0.0 0.0 0
> md30 0.0 0.0 0.0 0.0 0.0 0.0 0
> md100 0.0 8.0 0.0 75.8 0.0 8.1 6
> sd0 0.0 0.0 0.0 0.0 0.0 0.0 0
> sd1 0.0 8.0 0.0 75.8 0.0 6.6 5
> sd2 0.0 0.0 0.0 0.0 0.0 0.0 0
> sd3 0.0 8.0 0.0 75.8 0.0 7.4 5
> I would go well. I know that a ReplaceText processor could do the same easily 
> as a preceding step, but this is not always possible (with no-input 
> processors like TCPRecordReader) and I also believe less processors is better 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to