Vladimir Alexiev created JENA-1505:
--------------------------------------

             Summary: add function apf:strIndexSplit
                 Key: JENA-1505
                 URL: https://issues.apache.org/jira/browse/JENA-1505
             Project: Apache Jena
          Issue Type: Improvement
          Components: ARQ
            Reporter: Vladimir Alexiev


We use Tarql to convert some company CSV data to RDF.
We had cases of multiple values in a field (eg aliases) that we handle with 
apf:strSplit.

But now we've hit another case: several multi-value fields arranged in parallel 
arrays.
Each CSV row is a Joint Venture (?jvId, ?jvName) and there are 3 
newline-separated parallel arrays that describe the participant companies: 
?coIds, ?coNames, ?coIndustries.
If we use several apf:strSplit in one query, that will cause a Cartesian 
product, and mix up all company ids, names, industries together.

Tarql allows multiple CONSTRUCT queries in one script, and |the triples 
generated by previous CONSTRUCT clauses can be queries in subsequent WHERE 
clauses to retrieve additional data".

So my idea is to split each column in a separate CONSTRUCT, attach the values 
to temporary nodes, and reassemble them in a final CONSTRUCT.
But we can't do this with apf:strSplit, since it loses the index (ordering) of 
the individual values.
We need a new Jena ARQ function, eg with a signature like this where ? 
indicates unbound and $indicates bound:
{noformat}
(?index ?value) apf:strIndexSplit ($string $separator)
Splits $string on regex $separator and produces a number of binding pairs
where ?index is bound to a sequential number (starting from 1)
and ?value is bound to the consecutive string part that is split off.
{noformat}

Then we could hack the problem with something like this:
{noformat}
construct { # get first multiValue field
 ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
 bind(uri("uri:tmp:",?ROWNUM) as ?ROW)
 (?INDEX ?VALUE) apf:strIndexSplit (?coIds, "\\n")
}

construct { # get second multiValue field
 ?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
 bind(uri("uri:tmp:",?ROWNUM) as ?ROW)
 (?INDEX ?VALUE) apf:strIndexSplit (?coNames, "\\n")
}

construct { # get third multiValue field
 ?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
 bind(uri("uri:tmp:",?ROWNUM) as ?ROW)
 (?INDEX ?VALUE) apf:strIndexSplit (?coIndustries, "\\n")
}

construct { # make JV node
 ?JV ex:id ?jvId; ex:name ?jvName.
} where {
 bind(uri(concat("jv/",?jvId) as ?JV))
}

construct { # make Company node and relation
 ?CO ex:id ?coId; ex:name ?coName; ex:industry ?INDUSTRY.
 ?JV ex:hasParticipant ?CO
} where {
 bind(uri(concat("jv/",?jvId) as ?JV))
 bind(uri(concat("uri:tmp:",?ROWNUM) as ?ROW))
 ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?coId]
 optional \{?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?coName]}
 optional \{?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?coIndustry]}
 bind(uri(concat("company/",?coId) as ?CO)
 bind(uri(concat("industry/",?coIndustry) as ?INDUSTRY)
}
{noformat}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to