[
https://issues.apache.org/jira/browse/JENA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andy Seaborne updated JENA-1505:
--------------------------------
Labels: First (was: )
> add function apf:strIndexSplit
> ------------------------------
>
> Key: JENA-1505
> URL: https://issues.apache.org/jira/browse/JENA-1505
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ
> Reporter: Vladimir Alexiev
> Priority: Major
> Labels: First
>
> We use Tarql to convert some company CSV data to RDF.
> We had cases of multiple values in a field (eg aliases) that we handle with
> apf:strSplit.
> But now we've hit another case: several multi-value fields arranged in
> parallel arrays.
> Each CSV row is a Joint Venture (?jvId, ?jvName) and there are 3
> newline-separated parallel arrays that describe the participant companies:
> ?coIds, ?coNames, ?coIndustries.
> If we use several apf:strSplit in one query, that will cause a Cartesian
> product, and mix up all company ids, names, industries together.
> Tarql allows multiple CONSTRUCT queries in one script, and "the triples
> generated by previous CONSTRUCT clauses can be queries in subsequent WHERE
> clauses to retrieve additional data". So my idea is to split each column in a
> separate CONSTRUCT, attach the values to temporary nodes, and reassemble them
> in a final CONSTRUCT.
> But we can't do this with apf:strSplit, since it loses the index (ordering)
> of the individual values.
> We need a new Jena ARQ function, eg with a signature like this where ?
> indicates unbound and $indicates bound:
> {noformat}
> (?index ?value) apf:strIndexSplit ($string $separator)
> Splits $string on regex $separator and produces a number of binding pairs
> where ?index is bound to a sequential number (starting from 1)
> and ?value is bound to the consecutive string part that is split off.
> {noformat}
> Then we could hack the problem with something like this:
> {noformat}
> construct { # get first multiValue field
> ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
> bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
> (?INDEX ?VALUE) apf:strIndexSplit (?coIds, "\\n")
> }
> construct { # get second multiValue field
> ?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
> bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
> (?INDEX ?VALUE) apf:strIndexSplit (?coNames, "\\n")
> }
> construct { # get third multiValue field
> ?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
> bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
> (?INDEX ?VALUE) apf:strIndexSplit (?coIndustries, "\\n")
> }
> construct { # make JV node
> ?JV ex:id ?jvId; ex:name ?jvName.
> } where {
> bind(uri(concat("jv/",?jvId) as ?JV))
> }
> construct { # make Company node and relation
> ?CO ex:id ?coId; ex:name ?coName; ex:industry ?INDUSTRY.
> ?JV ex:hasParticipant ?CO
> } where {
> bind(uri(concat("jv/",?jvId) as ?JV))
> bind(uri(concat("urn:tmp:",?ROWNUM) as ?ROW))
> ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?coId]
> optional {?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?coName]}
> optional {?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?coIndustry]}
> bind(uri(concat("company/",?coId) as ?CO)
> bind(uri(concat("industry/",?coIndustry) as ?INDUSTRY)
> }
> {noformat}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)