Vladimir Alexiev created JENA-1505: -------------------------------------- Summary: add function apf:strIndexSplit Key: JENA-1505 URL: https://issues.apache.org/jira/browse/JENA-1505 Project: Apache Jena Issue Type: Improvement Components: ARQ Reporter: Vladimir Alexiev
We use Tarql to convert some company CSV data to RDF. We had cases of multiple values in a field (eg aliases) that we handle with apf:strSplit. But now we've hit another case: several multi-value fields arranged in parallel arrays. Each CSV row is a Joint Venture (?jvId, ?jvName) and there are 3 newline-separated parallel arrays that describe the participant companies: ?coIds, ?coNames, ?coIndustries. If we use several apf:strSplit in one query, that will cause a Cartesian product, and mix up all company ids, names, industries together. Tarql allows multiple CONSTRUCT queries in one script, and |the triples generated by previous CONSTRUCT clauses can be queries in subsequent WHERE clauses to retrieve additional data". So my idea is to split each column in a separate CONSTRUCT, attach the values to temporary nodes, and reassemble them in a final CONSTRUCT. But we can't do this with apf:strSplit, since it loses the index (ordering) of the individual values. We need a new Jena ARQ function, eg with a signature like this where ? indicates unbound and $indicates bound: {noformat} (?index ?value) apf:strIndexSplit ($string $separator) Splits $string on regex $separator and produces a number of binding pairs where ?index is bound to a sequential number (starting from 1) and ?value is bound to the consecutive string part that is split off. {noformat} Then we could hack the problem with something like this: {noformat} construct { # get first multiValue field ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?VALUE] } where { bind(uri("uri:tmp:",?ROWNUM) as ?ROW) (?INDEX ?VALUE) apf:strIndexSplit (?coIds, "\\n") } construct { # get second multiValue field ?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?VALUE] } where { bind(uri("uri:tmp:",?ROWNUM) as ?ROW) (?INDEX ?VALUE) apf:strIndexSplit (?coNames, "\\n") } construct { # get third multiValue field ?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?VALUE] } where { bind(uri("uri:tmp:",?ROWNUM) as ?ROW) (?INDEX ?VALUE) apf:strIndexSplit (?coIndustries, "\\n") } construct { # make JV node ?JV ex:id ?jvId; ex:name ?jvName. } where { bind(uri(concat("jv/",?jvId) as ?JV)) } construct { # make Company node and relation ?CO ex:id ?coId; ex:name ?coName; ex:industry ?INDUSTRY. ?JV ex:hasParticipant ?CO } where { bind(uri(concat("jv/",?jvId) as ?JV)) bind(uri(concat("uri:tmp:",?ROWNUM) as ?ROW)) ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?coId] optional \{?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?coName]} optional \{?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?coIndustry]} bind(uri(concat("company/",?coId) as ?CO) bind(uri(concat("industry/",?coIndustry) as ?INDUSTRY) } {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)