[jira] [Updated] (JENA-1505) add function apf:strIndexSplit

Vladimir Alexiev (JIRA) Sat, 10 Mar 2018 04:49:22 -0800

     [ 
https://issues.apache.org/jira/browse/JENA-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vladimir Alexiev updated JENA-1505:
-----------------------------------
    Description: 
We use Tarql to convert some company CSV data to RDF.
 We had cases of multiple values in a field (eg aliases) that we handle with 
apf:strSplit.

But now we've hit another case: several multi-value fields arranged in parallel 
arrays.
 Each CSV row is a Joint Venture (?jvId, ?jvName) and there are 3 
newline-separated parallel arrays that describe the participant companies: 
?coIds, ?coNames, ?coIndustries.
 If we use several apf:strSplit in one query, that will cause a Cartesian 
product, and mix up all company ids, names, industries together.

Tarql allows multiple CONSTRUCT queries in one script, and "the triples 
generated by previous CONSTRUCT clauses can be queries in subsequent WHERE 
clauses to retrieve additional data". So my idea is to split each column in a 
separate CONSTRUCT, attach the values to temporary nodes, and reassemble them 
in a final CONSTRUCT.

But we can't do this with apf:strSplit, since it loses the index (ordering) of 
the individual values.
 We need a new Jena ARQ function, eg with a signature like this where ? 
indicates unbound and $indicates bound:
{noformat}
(?index ?value) apf:strIndexSplit ($string $separator)
Splits $string on regex $separator and produces a number of binding pairs
where ?index is bound to a sequential number (starting from 1)
and ?value is bound to the consecutive string part that is split off.
{noformat}
Then we could hack the problem with something like this:
{noformat}
construct { # get first multiValue field
 ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
 bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
 (?INDEX ?VALUE) apf:strIndexSplit (?coIds, "\\n")
}

construct { # get second multiValue field
 ?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
 bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
 (?INDEX ?VALUE) apf:strIndexSplit (?coNames, "\\n")
}

construct { # get third multiValue field
 ?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
 bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
 (?INDEX ?VALUE) apf:strIndexSplit (?coIndustries, "\\n")
}

construct { # make JV node
 ?JV ex:id ?jvId; ex:name ?jvName.
} where {
 bind(uri(concat("jv/",?jvId) as ?JV))
}

construct { # make Company node and relation
 ?CO ex:id ?coId; ex:name ?coName; ex:industry ?INDUSTRY.
 ?JV ex:hasParticipant ?CO
} where {
 bind(uri(concat("jv/",?jvId) as ?JV))
 bind(uri(concat("urn:tmp:",?ROWNUM) as ?ROW))
 ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?coId]
 optional \{?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?coName]}
 optional \{?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?coIndustry]}
 bind(uri(concat("company/",?coId) as ?CO)
 bind(uri(concat("industry/",?coIndustry) as ?INDUSTRY)
}
{noformat}
 

  was:
We use Tarql to convert some company CSV data to RDF.
 We had cases of multiple values in a field (eg aliases) that we handle with 
apf:strSplit.

But now we've hit another case: several multi-value fields arranged in parallel 
arrays.
 Each CSV row is a Joint Venture (?jvId, ?jvName) and there are 3 
newline-separated parallel arrays that describe the participant companies: 
?coIds, ?coNames, ?coIndustries.
 If we use several apf:strSplit in one query, that will cause a Cartesian 
product, and mix up all company ids, names, industries together.

Tarql allows multiple CONSTRUCT queries in one script, and "the triples 
generated by previous CONSTRUCT clauses can be queries in subsequent WHERE 
clauses to retrieve additional data". So my idea is to split each column in a 
separate CONSTRUCT, attach the values to temporary nodes, and reassemble them 
in a final CONSTRUCT.


 But we can't do this with apf:strSplit, since it loses the index (ordering) of 
the individual values.
 We need a new Jena ARQ function, eg with a signature like this where ? 
indicates unbound and $indicates bound:
{noformat}
(?index ?value) apf:strIndexSplit ($string $separator)
Splits $string on regex $separator and produces a number of binding pairs
where ?index is bound to a sequential number (starting from 1)
and ?value is bound to the consecutive string part that is split off.
{noformat}
Then we could hack the problem with something like this:
{noformat}
construct { # get first multiValue field
 ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
 bind(uri("uri:tmp:",?ROWNUM) as ?ROW)
 (?INDEX ?VALUE) apf:strIndexSplit (?coIds, "\\n")
}

construct { # get second multiValue field
 ?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
 bind(uri("uri:tmp:",?ROWNUM) as ?ROW)
 (?INDEX ?VALUE) apf:strIndexSplit (?coNames, "\\n")
}

construct { # get third multiValue field
 ?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?VALUE]
} where {
 bind(uri("uri:tmp:",?ROWNUM) as ?ROW)
 (?INDEX ?VALUE) apf:strIndexSplit (?coIndustries, "\\n")
}

construct { # make JV node
 ?JV ex:id ?jvId; ex:name ?jvName.
} where {
 bind(uri(concat("jv/",?jvId) as ?JV))
}

construct { # make Company node and relation
 ?CO ex:id ?coId; ex:name ?coName; ex:industry ?INDUSTRY.
 ?JV ex:hasParticipant ?CO
} where {
 bind(uri(concat("jv/",?jvId) as ?JV))
 bind(uri(concat("uri:tmp:",?ROWNUM) as ?ROW))
 ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?coId]
 optional \{?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?coName]}
 optional \{?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?coIndustry]}
 bind(uri(concat("company/",?coId) as ?CO)
 bind(uri(concat("industry/",?coIndustry) as ?INDUSTRY)
}
{noformat}
 


> add function apf:strIndexSplit
> ------------------------------
>
>                 Key: JENA-1505
>                 URL: https://issues.apache.org/jira/browse/JENA-1505
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Vladimir Alexiev
>            Priority: Major
>
> We use Tarql to convert some company CSV data to RDF.
>  We had cases of multiple values in a field (eg aliases) that we handle with 
> apf:strSplit.
> But now we've hit another case: several multi-value fields arranged in 
> parallel arrays.
>  Each CSV row is a Joint Venture (?jvId, ?jvName) and there are 3 
> newline-separated parallel arrays that describe the participant companies: 
> ?coIds, ?coNames, ?coIndustries.
>  If we use several apf:strSplit in one query, that will cause a Cartesian 
> product, and mix up all company ids, names, industries together.
> Tarql allows multiple CONSTRUCT queries in one script, and "the triples 
> generated by previous CONSTRUCT clauses can be queries in subsequent WHERE 
> clauses to retrieve additional data". So my idea is to split each column in a 
> separate CONSTRUCT, attach the values to temporary nodes, and reassemble them 
> in a final CONSTRUCT.
> But we can't do this with apf:strSplit, since it loses the index (ordering) 
> of the individual values.
>  We need a new Jena ARQ function, eg with a signature like this where ? 
> indicates unbound and $indicates bound:
> {noformat}
> (?index ?value) apf:strIndexSplit ($string $separator)
> Splits $string on regex $separator and produces a number of binding pairs
> where ?index is bound to a sequential number (starting from 1)
> and ?value is bound to the consecutive string part that is split off.
> {noformat}
> Then we could hack the problem with something like this:
> {noformat}
> construct { # get first multiValue field
>  ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
>  bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
>  (?INDEX ?VALUE) apf:strIndexSplit (?coIds, "\\n")
> }
> construct { # get second multiValue field
>  ?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
>  bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
>  (?INDEX ?VALUE) apf:strIndexSplit (?coNames, "\\n")
> }
> construct { # get third multiValue field
>  ?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?VALUE]
> } where {
>  bind(uri("urn:tmp:",?ROWNUM) as ?ROW)
>  (?INDEX ?VALUE) apf:strIndexSplit (?coIndustries, "\\n")
> }
> construct { # make JV node
>  ?JV ex:id ?jvId; ex:name ?jvName.
> } where {
>  bind(uri(concat("jv/",?jvId) as ?JV))
> }
> construct { # make Company node and relation
>  ?CO ex:id ?coId; ex:name ?coName; ex:industry ?INDUSTRY.
>  ?JV ex:hasParticipant ?CO
> } where {
>  bind(uri(concat("jv/",?jvId) as ?JV))
>  bind(uri(concat("urn:tmp:",?ROWNUM) as ?ROW))
>  ?ROW tmp:coIds [tmp:index ?INDEX; tmp:value ?coId]
>  optional \{?ROW tmp:coNames [tmp:index ?INDEX; tmp:value ?coName]}
>  optional \{?ROW tmp:coIndustries [tmp:index ?INDEX; tmp:value ?coIndustry]}
>  bind(uri(concat("company/",?coId) as ?CO)
>  bind(uri(concat("industry/",?coIndustry) as ?INDUSTRY)
> }
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (JENA-1505) add function apf:strIndexSplit

Reply via email to