Re: [basex-talk] csv:parse in the age of XQuery 3.1

Joe Wicentowski Sun, 11 Sep 2016 12:45:23 -0700

Hans-Jürgen,
I figured as much. I wonder if we can come up with an xsd-compliant regex for 
this purpose? It may not give us a full-featured CSV parser, but would handle 
reasonably uniform cases.
Joe

Sent from my iPhone

On Sun, Sep 11, 2016 at 3:39 PM -0400, "Hans-Juergen Rennau" <hren...@yahoo.de> 
wrote:

Joe, concerning your regex, I would complain, too! Already the first two 
characters 
    (?render the expression invalid:(1) An unescaped ? is an occurrence 
indicator, making the preceding entity optional(2) An unescaped ( is used for 
grouping, it does not repesent anything
=> there is no entity preceding the ? which the ? could make optional => error

Please keep in mind that the regex flavor supported by XPath is the regex 
flavor defined by the XSD spec. There are a few constructs used in Perl & Co 
which are not defined in XPath regex.

What concerns the CSV implementation, I came to realize my error: the BaseX 
implementation *is* Java code, not XQuery code - the xqm module just contains 
the function signature, marked "external".
Cheers,Hans

    Joe Wicentowski <joe...@gmail.com> schrieb am 21:27 Sonntag, 11.September 
2016:

 Thanks for your replies and interest, Hans-Jürgen, Marc, Vincent, and 
Christian.

The other day, short of a comprehensive solution, I went in search of
a regex that would handle quoted values that contain commas that
shouldn't serve as delimiters.  I found one that worked in eXist but
not in BaseX.

Source for the regex: http://stackoverflow.com/a/13259681/659732

The query:

```
xquery version "3.1";

let $csv := 'Author,Title,ISBN,Binding,Year Published
Jeannette Walls,The Glass Castle,074324754X,Paperback,2006
James Surowiecki,The Wisdom of Crowds,9780385503860,Paperback,2005
Lawrence Lessig,The Future of Ideas,9780375505782,Paperback,2002
"Larry Bossidy, Ram Charan, Charles
Burck",Execution,9780609610572,Hardcover,2002
Kurt Vonnegut,Slaughterhouse-Five,9780791059258,Paperback,1999'
let $lines := tokenize($csv, '
')
let $header-row := fn:head($lines)
let $body-rows := fn:tail($lines)
let $headers := fn:tokenize($header-row, ",") ! fn:replace(., " ", "")
for $row in $body-rows
let $cells := fn:analyze-string($row,
'(?:\s*(?:\"([^\"]*)\"|([^,]+))\s*,?|(?<=,)(),?)+?')//fn:group
return
    element Book {
      for $cell at $count in $cells
      return element {$headers[$count]} {$cell/string()}
    }
It produces the desired results:

<Book>
    <Author>Jeannette Walls</Author>
    <Title>The Glass Castle</Title>
    <ISBN>074324754X</ISBN>
    <Binding>Paperback</Binding>
    <YearPublished>2006</YearPublished>
</Book>
<Book>
    <Author>James Surowiecki</Author>
    <Title>The Wisdom of Crowds</Title>
    <ISBN>9780385503860</ISBN>
    <Binding>Paperback</Binding>
    <YearPublished>2005</YearPublished>
</Book>
<Book>
    <Author>Lawrence Lessig</Author>
    <Title>The Future of Ideas</Title>
    <ISBN>9780375505782</ISBN>
    <Binding>Paperback</Binding>
    <YearPublished>2002</YearPublished>
</Book>
<Book>
    <Author>Larry Bossidy, Ram Charan, Charles Burck</Author>
    <Title>Execution</Title>
    <ISBN>9780609610572</ISBN>
    <Binding>Hardcover</Binding>
    <YearPublished>2002</YearPublished>
</Book>
<Book>
    <Author>Kurt Vonnegut</Author>
    <Title>Slaughterhouse-Five</Title>
    <ISBN>9780791059258</ISBN>
    <Binding>Paperback</Binding>
    <YearPublished>1999</YearPublished>
</Book>

Unfortunately BaseX complains about the regex, with the following error:

Stopped at /Users/joe/file, 9/32: [FORX0002] Invalid regular
expression: (?:\s(?:\"([^\"])\"|([^,]+))\s*,?|(?<=,)(),?)+?.

Without a column location, I'm unable to tell where the problem is.
Is there something used in this expression that BaseX doesn't support?

On the topic of the potential memory pitfalls of a pure XQuery
solution for our hypothetical EXPath library, I think the primary
problem is that the entire CSV has to be loaded into memory.  I wonder
if implementations could use the new `fn:unparsed-text-lines()`
function from XQuery 3.0 to stream the CSV through XQuery without
requiring the entire thing to be in memory?  Or are we basically
setting ourselves up for the EXPath solution being a wrapper around an
external library written in a lower level language?

Joe

On Sun, Sep 11, 2016 at 4:53 AM, Christian Grün
<christian.gr...@gmail.com> wrote:
> Hi Joe,
>
> Thanks for your mail. You are completely right, using an array would
> be the natural choice with csv:parse. It’s mostly due to backward
> compatibility that we didn’t update the function.
>
> @All: I’m pretty sure that all of us would like having an EXPath spec
> for parsing CSV data. We still need one volunteer to make it happen ;)
> Anyone out there?
>
> Cheers
> Christian
>
>
> On Thu, Sep 8, 2016 at 6:13 AM, Joe Wicentowski <joe...@gmail.com> wrote:
>> Dear BaseX developers,
>>
>> I noticed in example 3 under
>> http://docs.basex.org/wiki/CSV_Module#Examples that csv:parse() with
>> option { 'format': 'map' } returns a map of maps, with hardcoded row
>> numbers:
>>
>> map {
>>     1: map {
>>         "City": "Newton",
>>         "Name": "John"
>>     },
>>     2: map {
>>         "City": "Oldtown",
>>         "Name": "Jack"
>>     }
>> }
>>
>> Using maps, which are unordered, to represent something ordered like
>> rows in a CSV, hardcoded row numbers are necessary for reassembling
>> the map in document order.  I assume this was a necessary approach
>> when the module was developed in the map-only world of XQuery 3.0.
>> Now that 3.1 supports arrays, might an array of maps be a closer fit
>> for CSV parsing?
>>
>> array {
>>     map {
>>         "City": "Newton",
>>         "Name": "John"
>>     },
>>     map {
>>         "City": "Oldtown",
>>         "Name": "Jack"
>>     }
>> }
>>
>> I'm also curious, do you know of any efforts to create an EXPath spec
>> for CSV?  Putting spec and CSV in the same sentence is dangerous,
>> since CSV is a notoriously under-specified format: "The CSV file
>> format is not standardized" (see
>> https://en.wikipedia.org/wiki/Comma-separated_values).  But perhaps
>> there is a common enough need for CSV parsing that such a spec would
>> benefit the community?  I thought I'd start by asking here, since
>> BaseX's seems to be the most developed (or only?) CSV module in
>> XQuery.
>>
>> Then there's the question of how to approach implementations of such a
>> spec.  While XQuery is probably capable of parsing and serializing
>> small enough CSV, CSVs do get large and naive processing with XQuery
>> would tend to run into memory issues (as I found with xqjson).  This
>> means implementations would tend to write in a lower-level language.
>> eXist, for example, uses Jackson for fn:parse-json().  I see Jackson
>> has a CSV extension too:
>> https://github.com/FasterXML/jackson-dataformat-csv.  Any thoughts on
>> the suitability of XQuery for the task?
>>
>> Joe

Re: [basex-talk] csv:parse in the age of XQuery 3.1

Reply via email to