[jira] [Commented] (JENA-1505) add function apf:strIndexSplit

Andy Seaborne (JIRA) Thu, 22 Mar 2018 03:40:21 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409356#comment-16409356
 ]


Andy Seaborne commented on JENA-1505:
-------------------------------------

`apf:strIndexSplit` seems like a good idea.  Do you want to put in PR for it?

> add function apf:strIndexSplit
> ------------------------------
>
>                 Key: JENA-1505
>                 URL: https://issues.apache.org/jira/browse/JENA-1505
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Vladimir Alexiev
>            Priority: Major
>
> We use Tarql to convert some company CSV data to RDF.
>  We had cases of multiple values in a field (eg aliases) that we handle with 
> apf:strSplit.
> But now we've hit another case: several multi-value fields arranged in 
> parallel arrays.
>  Each CSV row is a Joint Venture (?jvId, ?jvName) and there are 3 
> newline-separated parallel arrays that describe the participant companies: 
> ?coIds, ?coNames, ?coIndustries.
>  If we use several apf:strSplit in one query, that will cause a Cartesian 
> product, and mix up all company ids, names, industries together.
> Tarql allows multiple CONSTRUCT queries in one script, and "the triples 
> generated by previous CONSTRUCT clauses can be queries in subsequent WHERE 
> clauses to retrieve additional data". So my idea is to split each column in a 
> separate CONSTRUCT, attach the values to temporary nodes, and reassemble them 
> in a final CONSTRUCT.
> But we can't do this with apf:strSplit, since it loses the index (ordering) 
> of the individual values.
>  We need a new Jena ARQ function, eg with a signature like this where ? 
> indicates unbound and $indicates bound:
> {noformat}
> (?index ?value) apf:strIndexSplit ($string $separator)
> Splits $string on regex $separator and produces a number of binding pairs
> where ?index is bound to a sequential number (starting from 1)
> and ?value is bound to the consecutive string part that is split off.
> {noformat}
> Then we could hack the problem with something like this:
> {noformat}
> construct { # get first multiValue field
>  ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
>  bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
>  (?INDEX ?VALUE) apf:strIndexSplit (?coIds, "\\n")
> }
> construct { # get second multiValue field
>  ?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
>  bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
>  (?INDEX ?VALUE) apf:strIndexSplit (?coNames, "\\n")
> }
> construct { # get third multiValue field
>  ?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
>  bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
>  (?INDEX ?VALUE) apf:strIndexSplit (?coIndustries, "\\n")
> }
> construct { # make JV node
>  ?JV ex:id ?jvId; ex:name ?jvName.
> } where {
>  bind(uri(concat("jv/",?jvId) as ?JV))
> }
> construct { # make Company node and relation
>  ?CO ex:id ?coId; ex:name ?coName; ex:industry ?INDUSTRY.
>  ?JV ex:hasParticipant ?CO
> } where {
>  bind(uri(concat("jv/",?jvId) as ?JV))
>  bind(uri(concat("urn:tmp:",?ROWNUM) as ?ROW))
>            ?ROW tmp:coIds        [tmp:index ?INDEX; tmp:value ?coId]
>  optional {?ROW tmp:coNames      [tmp:index ?INDEX; tmp:value ?coName]}
>  optional {?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?coIndustry]}
>  bind(uri(concat("company/",?coId) as ?CO)
>  bind(uri(concat("industry/",?coIndustry) as ?INDUSTRY)
> }
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (JENA-1505) add function apf:strIndexSplit

Reply via email to