SethFalco commented on PR #309:
URL: https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258

   I'm up for anything, but I'll at least give my personal thoughts on the 
methods described above. Note… I might be overcomplicated things tbh…
   
   > The stdlib csv module has the same behavior documented here, displaying 
the last occurrence.
   
   I feel just doing the last occurrence is very likely to lead to scenarios 
where:
   1. User implements something using this library
   2. They run into unexpected results due to duplicate header names
   3. They code in a workaround to handle this situation as they feel best fits 
in with their application, likely either:
       * Disallow duplicate header names
       * Use a different method to iterate through columns
       * Deduplicate the headings themselves before providing it to Commons CSV.
   
   I feel like throwing an exception makes more sense, so developers can skip 
to step 3 sooner. :thinking: 
   
   > With Pandas it automatically deduplicates the column names
   
   This actually crossed my mind, but I was hesitant to modify the data 
provided by users, at least it's probably best not to do this implicitly imo.
   
   I'm also wary that this would screw up existing projects that might depend 
on/prefer simply allowing/disallowing duplicates. (i.e. want to allow 
duplicates and handle things through indexes / iteration, so this didn't cause 
a problem for them and therefore don't need the headers deduplicated)
   
   If we want to put this in the library, I'd be tempted to do one of the 
following.
   
   ### `HeaderStrategy` Interface
   
   Contains two functions:
   * `#normalizeHeaders(headings)` - With given header names, output a new list 
that fits with whatever the strategy is going for.
   * `#get(record, header)` - Fetch value(s) based on given column name.
   
   The `#normalizeHeaders` function would be used during parsing. 
Implementations may choose not to do any normalization at all. (i.e. they don't 
want to rename headings)
   
   This would allow things like deduplicating headings. I think it's best to 
separate normalization from fetching as implementations like this may want to 
display a warning in `#get` if the user tries to get one of the original header 
names before normalization. (i.e. `A,A` → `A.1,A.2` but user does `#get("A")`. 
People may be confused why it didn't work when it was indeed one of the 
original headings.)
   
   The `#get` method would be used to get the value from the record. A strategy 
might be the:
   * last occurrence
   * the last occurrence with a non-empty value
   * all values joined with a comma (`,`) or semicolon (`;`)
   
   We could define one based on what we think the default behavior. This should 
reduce maintenance burden if users have different ideas on how this should be 
handled (so we don't have to support all use-cases ourselves). This enables 
consumers of the lib to override it with whatever they want.
   
   ### Repurpose `DuplicateHeaderMode`
   
   We could introduce modes like `Deduplicate`. Unlike the above, this doesn't 
let users implement and provide their own strategies to the builder, they'd be 
stuck with our enum, which also means we shoulder the responsibility to 
implement whatever use-cases others have.)
   
   ### Something Else
   
   Both solutions above might over-complicate things or increase maintenance 
effort for little gain. :thinking: Users already have the parser, from there 
they could technically do any normalization themselves already so long as we 
document one way or another what happens with duplicate headers if they do 
choose to leave them in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to