Re: [CSV] Strategies to handle duplicate headers

2023-06-21 Thread sebb
On Tue, 20 Jun 2023 at 12:39, Gary Gregory wrote: > > Hi All, > > This thread is a follow-up to > https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258 > > Bruno says: > "With Pandas it automatically deduplicates the column names. Maybe > that's a feature that we could have in

RE: Re: [CSV] Strategies to handle duplicate headers

2023-06-21 Thread Seth Falco
I don't have a strong enough opinion to conclude what's best. Giving it more thought, I think the interface approach I proposed is overcomplicated tbh. I can't imagine needing another duplicate header mode after this. However, I could imagine situations where we define

Re: [CSV] Strategies to handle duplicate headers

2023-06-21 Thread Gary Gregory
Well, maybe we should not have a postfix string method, that assumes a lot. A default implementation of a function to convert all header names sounds better. Gary On Wed, Jun 21, 2023, 09:11 Gary Gregory wrote: > So it is starting to sound like we need either to add to CSVFormat: > > -

Re: [CSV] Strategies to handle duplicate headers

2023-06-21 Thread Gary Gregory
So it is starting to sound like we need either to add to CSVFormat: - "duplicate header postix string", or - deprecate duplicate header mode in favor of a duplicate header strategy which holds a duplicate header mode plus a duplicate header postfix string and some functional interface for custom

Re: [CSV] Strategies to handle duplicate headers

2023-06-21 Thread David Dellsperger
I've always had a big concern with this kind of behavior, because what happens if the "new column" already exists but later in the header? It seems like python/pandas deals with this by incrementing AGAIN, so they read the header and THEN decide what to do with the values for duplicates (make

Re: [CSV] Strategies to handle duplicate headers

2023-06-20 Thread Bruno Kinoshita
Hi, > However, I could imagine situations where we define > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our > normalization strategy. For example, dots in the headers breaks ingesting > the data in a third-party system. An interface could resolve this, but I > guess in such

RE: [CSV] Strategies to handle duplicate headers

2023-06-20 Thread Seth Falco
I don't have a strong enough opinion to conclude what's best. Giving it more thought, I think the interface approach I proposed is overcomplicated tbh. I can't imagine needing another duplicate header mode after this. However, I could imagine situations where we define

Re: [CSV] Strategies to handle duplicate headers

2023-06-20 Thread Gary Gregory
That's clever. So we could implement a new enum value DuplicateHeaderMode.DEDUPLICATE... Gary On Tue, Jun 20, 2023, 14:09 Bruno Kinoshita wrote: > Hi, > > Bruno says: > > "With Pandas it automatically deduplicates the column names. Maybe > > that's a feature that we could have in Commons CSV

Re: [CSV] Strategies to handle duplicate headers

2023-06-20 Thread Bruno Kinoshita
Hi, Bruno says: > "With Pandas it automatically deduplicates the column names. Maybe > that's a feature that we could have in Commons CSV too?" > > What does that mean and actually do? Say I have column A with row 1 > value of "X" and 2nd column A with row 1 value of 2. What do I get > when I ask

[CSV] Strategies to handle duplicate headers

2023-06-20 Thread Gary Gregory
Hi All, This thread is a follow-up to https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258 Bruno says: "With Pandas it automatically deduplicates the column names. Maybe that's a feature that we could have in Commons CSV too?" What does that mean and actually do? Say I have