[ https://issues.apache.org/jira/browse/JENA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409356#comment-16409356 ]
Andy Seaborne commented on JENA-1505: ------------------------------------- `apf:strIndexSplit` seems like a good idea. Do you want to put in PR for it? > add function apf:strIndexSplit > ------------------------------ > > Key: JENA-1505 > URL: https://issues.apache.org/jira/browse/JENA-1505 > Project: Apache Jena > Issue Type: Improvement > Components: ARQ > Reporter: Vladimir Alexiev > Priority: Major > > We use Tarql to convert some company CSV data to RDF. > We had cases of multiple values in a field (eg aliases) that we handle with > apf:strSplit. > But now we've hit another case: several multi-value fields arranged in > parallel arrays. > Each CSV row is a Joint Venture (?jvId, ?jvName) and there are 3 > newline-separated parallel arrays that describe the participant companies: > ?coIds, ?coNames, ?coIndustries. > If we use several apf:strSplit in one query, that will cause a Cartesian > product, and mix up all company ids, names, industries together. > Tarql allows multiple CONSTRUCT queries in one script, and "the triples > generated by previous CONSTRUCT clauses can be queries in subsequent WHERE > clauses to retrieve additional data". So my idea is to split each column in a > separate CONSTRUCT, attach the values to temporary nodes, and reassemble them > in a final CONSTRUCT. > But we can't do this with apf:strSplit, since it loses the index (ordering) > of the individual values. > We need a new Jena ARQ function, eg with a signature like this where ? > indicates unbound and $indicates bound: > {noformat} > (?index ?value) apf:strIndexSplit ($string $separator) > Splits $string on regex $separator and produces a number of binding pairs > where ?index is bound to a sequential number (starting from 1) > and ?value is bound to the consecutive string part that is split off. > {noformat} > Then we could hack the problem with something like this: > {noformat} > construct { # get first multiValue field > ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?VALUE] > } where { > bind(uri("urn:tmp:",?ROWNUM) as ?ROW) > (?INDEX ?VALUE) apf:strIndexSplit (?coIds, "\\n") > } > construct { # get second multiValue field > ?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?VALUE] > } where { > bind(uri("urn:tmp:",?ROWNUM) as ?ROW) > (?INDEX ?VALUE) apf:strIndexSplit (?coNames, "\\n") > } > construct { # get third multiValue field > ?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?VALUE] > } where { > bind(uri("urn:tmp:",?ROWNUM) as ?ROW) > (?INDEX ?VALUE) apf:strIndexSplit (?coIndustries, "\\n") > } > construct { # make JV node > ?JV ex:id ?jvId; ex:name ?jvName. > } where { > bind(uri(concat("jv/",?jvId) as ?JV)) > } > construct { # make Company node and relation > ?CO ex:id ?coId; ex:name ?coName; ex:industry ?INDUSTRY. > ?JV ex:hasParticipant ?CO > } where { > bind(uri(concat("jv/",?jvId) as ?JV)) > bind(uri(concat("urn:tmp:",?ROWNUM) as ?ROW)) > ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?coId] > optional {?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?coName]} > optional {?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?coIndustry]} > bind(uri(concat("company/",?coId) as ?CO) > bind(uri(concat("industry/",?coIndustry) as ?INDUSTRY) > } > {noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)