[jira] [Commented] (CSV-131) Save positions of records to enable random access

Holger Stratmann (JIRA) Wed, 17 Sep 2014 06:05:07 -0700

    [ 
https://issues.apache.org/jira/browse/CSV-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14137184#comment-14137184
 ]


Holger Stratmann commented on CSV-131:
--------------------------------------

{quote}Parse this new CSV data but start counting characters as X and start 
counting records at Y{quote}
Yes, that is exactly the point.
{quote}Why not just say, skip to record R or skip to char position P?{quote}
Because you cannot skip to char position P (and much less to record R) without 
reading the entire stream (up to that position/record) - which is exactly what 
I am trying to avoid. Just as in the test case, I want to start reading at some 
position in the middle. Actually, setting the record number and character 
position is purely cosmetic: I want the returned records to be identical to the 
ones I read when reading the full stream...
I agree that the setters are not really nice. Calling them only makes sense 
before you start reading (i.e. directly after calling the constructor). I made 
setters because I wanted to make minimal changes. The positions might make more 
sense as additional parameters to the constructor ("Here is a reader and some 
information about it"). I just didn't want to make additional versions of each 
constructor, but when I take another look at it now, it would probably only 
really concern the one that takes a reader.
So we could make a constructor
{code}public CSVParser(final Reader reader, final CSVFormat format, final int 
currentPosition, final int nextRecordNumber) throws IOException {code}
and remove the setters (and have the current constructor just call this(reader, 
format, 0, 1).
If you like this idea better, I can submit a new patch or you can modify it, 
whichever you prefer.

> Save positions of records to enable random access
> -------------------------------------------------
>
>                 Key: CSV-131
>                 URL: https://issues.apache.org/jira/browse/CSV-131
>             Project: Commons CSV
>          Issue Type: Improvement
>          Components: Parser
>    Affects Versions: 1.1
>            Reporter: Holger Stratmann
>            Priority: Minor
>         Attachments: PositionTrackingFull_v101_20140910.patch, 
> PositionTrackingTest_20140907.patch, PositionTracking_20140907.patch, 
> ggregory-CSV-131-parser-and-record.diff
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It would be good to have {{CSVRecord}} save its position in the source stream.
> Reason: Knowing the position of the records would enable random access to 
> retrieve records from the source (after reading it once to build an index) if 
> the file is too large to be read into memory (or if we don't want to read the 
> full file to access a record in the middle).
> Additional info: I have created a "random access csv reader" and a "csv 
> viewer" (Swing) for arbitrarily large CSV files. It requires one additional 
> scan of the file to build an index (multi-byte charsets supported). The index 
> can be saved to a file so it only needs to be built once. Because the lexer 
> uses a BufferedReader, we need "internal information" to know where each 
> record starts.
> The change to "core" is minor: one field in {{CSVRecord}}s and some 
> associated methods to store the position.
> Patch will be attached.
> Code for random access (both UI and non-UI) will be proposed (and possibly 
> submitted) as a separate issue. It could also be an independent add-on but 
> requires this one little change to Commons CSV.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CSV-131) Save positions of records to enable random access

Reply via email to