On 12/06/14 14:30, Claude Warren wrote:
Quick question:
Would it make sense to have an immutable flag that would tell the optimizer
(or other processes) that a dataset/model/graph is not likely to change?
More of a hint rather than a rule.ANALYZE
No point - the stats are assumed to be good enough until invalidated
externally. They are not (currently) dynamically maintained or a deamon
process could sweep passed and update them.
c.f. PostgreSQL ANALYZE
Andy
On Thu, Jun 12, 2014 at 2:15 PM, Rob Vesse <[email protected]> wrote:
You may be interested in the following paper -
http://www.csd.uoc.gr/~hy561/papers/storageaccess/optimization/Characterist
ic%20Sets.pdf - on a technique called RDF Characteristic Sets
It tries to solve the problem Andy alludes to that most stats based
optimisers consider triple patterns in isolation of each other rather than
as complete units. The downside of the RDF Characteristic Sets approach
is that they are potentially very expensive to calculate and would be
awkward to maintain for mutable data sets.
Rob
On 12/06/2014 13:36, "Andy Seaborne" <[email protected]> wrote:
On 12/06/14 03:35, DongNing(董宁.阿帕比) wrote:
Thanks Andy!
For more detail on question 2:
If a triples DB such as below--
S1 :identifier P1
S2 :identifier P2
S3 :identifier P3
S4 :identifier P4
S5 :identifier P5
The Count to (var :identifier TERM) is 1
The Count to (var :identifier var ) is 5
Is OK?
Yes
But if triples is such as these:
S1 :identifier P1
S2 :identifier P1
S3 :identifier P1
S4 :identifier P1
S5 :identifier P1
The Count to (var :identifier TERM) is 1 or 5?,I think is 5.
5
The Count to (var :identifier var ) is 5.
Is OK?
In addition situation -----if triples like these
S1 :identifier P1
S2 :identifier P1
S3 :identifier P2
S4 :identifier P2
S5 :identifier P3
The Count to (var :identifier TERM) is ?.
Overall points first:
* the optimizer is not trying to find the perfect answer, it's trying to
find a reasonable answer, mainly deciding between alternatives. And to
some extent its role in life is avoiding the bad as much as finding the
good!
* The stats optimizer isn't a perfect scheme (see the RDF3X papers for
more discussion) because it only considers triples independent. The
stats are an appromixation.
See also the current fixed optimizer.
(var :identifier TERM) .. maybe 2. It's not about exactness; only the
first triple gets an exact look up where you could have
(var :identifier P1) 2
(var :identifier P2) 2
(var :identifier P3) 1
It could reorder after every pattern but that might end up with the
optimizer costing more then the execution.
Andy
Thank Again!
Tony
-----邮件原件-----
发件人: Andy Seaborne [mailto:[email protected]]
发送时间: 2014年6月12日 2:08
收件人: [email protected]
主题: Re: TDB OPTIMIZER question:a puzzled of RULE language about " VAR
and TERM "
On 11/06/14 08:03, DongNing(董宁.阿帕比) wrote:
Hi all:
I am a beginner of jena,I am studying at TDB’S optimizer. About
Statistics rule.
1. I think TERM and VAR’s difference is VAR represent a variant
in sparql. TREM only represent the probable value in the DB, it don’t
represent a variant in sparql.
Is that right?
Yes - TERM means "will be bound at this point"
2. For a statics graph DB(triples are fixed,do not changed)
Count to (var :identifier TERM) and Count to ( Var :identifier var)
should be same?
No.
(var :identifier TERM) should be an estimate of what the cardinality
when there is a specific value. (var :identifier var) would be count of
all uses of :identifier.
if :ifp is an inverse function property,
(?x :ifp TERM) is one.
3. And there are a few explanation and samples on
http://jena.apache.org .Are there any other tutorial about statistics
rule?
Only the code I'm afraid.
Andy
THANK!
Tony.Dong