A compound index on _type+_path or _type +_path + _key does not improve things. The query time is still in the range of 120 to 150 seconds.
Andreas Am Freitag, 14. Juni 2019 15:01:30 UTC+2 schrieb Wilfried Gösgens: > > May I get back to my sugestion once more? > > Could you instead of the index on `_type` create a combined index ot > `_type`, `_path` and `_key` ? > This should copy these fields into the index, so ArangoDB doesn't have to > fetch the (big) documents. > I gues fetching and decompressing them is huge. > > Another suggestion would be to put the payload (You've got base64 encoded > binary data, right?) into a separate collection, parted of the structural > information. > > Cheers, > Willi > > On Friday, June 14, 2019 at 1:10:44 PM UTC+2, Andreas Jung wrote: >> >> All _path value are unique, we have about 20 different values for _type. >> I am not sure if I can break down the dataset into something smaller. >> The data is in general sensitive and not easy to share or anonymize. >> >> >> Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens: >>> >>> >>> Hi, >>> Can you share a set of sample documents? How well is the distribution on >>> `_type` ? Which samples are there? >>> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote: >>>> >>>> Recreating the indexes after import does not make a difference. >>>> >>>> Returning doc._path for 20.000 items takes 50 ms, returning doc._path >>>> takes minutes >>>> >>>> The _path index is deduplicated, the _type index is not >>>> >>>> The only difference in the execution plans is "index only" when "RETURN >>>> doc._type". Since both _type and _path >>>> are fully indexed I would assume that the query is executed in both >>>> times based on index data. >>>> >>>> So ArangoDB will load all 100.000 objects for picking up the value of >>>> _path? The overall data is meanwhile 55 GB >>>> (about one third of the data is binary data (files and images base64 >>>> encoded). >>>> >>>> This is all no big problem for me since we perform such queries once >>>> before a migration run and it does matter taking >>>> a migration running for some hours a minutes more or less but I want to >>>> understand what is going on here (in particular >>>> this is unexpected behavior). >>>> >>>> >>>> Query String: >>>> for doc in import >>>> filter doc._type == 'Image' >>>> return doc._type >>>> >>>> Execution plan: >>>> Id NodeType Est. Comment >>>> 1 SingletonNode 1 * ROOT >>>> 7 IndexNode 2214 - FOR doc IN import /* hash index >>>> scan, index only, projections: `_type` */ >>>> 5 CalculationNode 2214 - LET #3 = doc.`_type` /* >>>> attribute expression */ /* collections used: doc : import */ >>>> 6 ReturnNode 2214 - RETURN #3 >>>> >>>> Indexes used: >>>> By Type Collection Unique Sparse Selectivity Fields >>>> Ranges >>>> 7 hash import false false 0.05 % [ `_type` ] >>>> (doc.`_type` == "Image") >>>> >>>> Optimization rules applied: >>>> Id RuleName >>>> 1 move-calculations-up >>>> 2 move-filters-up >>>> 3 move-calculations-up-2 >>>> 4 move-filters-up-2 >>>> 5 use-indexes >>>> 6 remove-filter-covered-by-index >>>> 7 remove-unnecessary-calculations-2 >>>> 8 reduce-extraction-to-projection >>>> >>>> >>>> >>>> Query String: >>>> for doc in import >>>> filter doc._type == 'Image' >>>> return doc._path >>>> >>>> Execution plan: >>>> Id NodeType Est. Comment >>>> 1 SingletonNode 1 * ROOT >>>> 7 IndexNode 2214 - FOR doc IN import /* hash index >>>> scan, projections: `_path` */ >>>> 5 CalculationNode 2214 - LET #3 = doc.`_path` /* >>>> attribute expression */ /* collections used: doc : import */ >>>> 6 ReturnNode 2214 - RETURN #3 >>>> >>>> Indexes used: >>>> By Type Collection Unique Sparse Selectivity Fields >>>> Ranges >>>> 7 hash import false false 0.05 % [ `_type` ] >>>> (doc.`_type` == "Image") >>>> >>>> Optimization rules applied: >>>> Id RuleName >>>> 1 move-calculations-up >>>> 2 move-filters-up >>>> 3 move-calculations-up-2 >>>> 4 move-filters-up-2 >>>> 5 use-indexes >>>> 6 remove-filter-covered-by-index >>>> 7 remove-unnecessary-calculations-2 >>>> 8 reduce-extraction-to-projection >>>> >>>> >>>> >>>> >>>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote: >>>>> >>>>> Using RocksDB (default installation). >>>>> >>>>> I create a new collection for every import of the data including the >>>>> indexes. >>>>> >>>>> Unfortunately I don't have the key names in my hands. They are coming >>>>> from a JSON dump of a CMS. >>>>> >>>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens: >>>>>> >>>>>> Hi, >>>>>> afair you're using rocksdb? >>>>>> >>>>>> can you try to re-create that index to be on `_type`, `_path`, `_key` >>>>>> for better using of projections? >>>>>> >>>>>> Please note that you shouldn't use fieldnames starting with `_` since >>>>>> they're defined as system specific fields in arangodb. >>>>>> >>>>>> Cheers, >>>>>> Willi >>>>>> >>>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote: >>>>>>> >>>>>>> _key is a UUID4 >>>>>>> _path is standard filesystem path not longer than 100 chars each >>>>>>> >>>>>>> That can not be the problem. >>>>>>> >>>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James >>>>>>> Courtier-Dutton: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> What is the average size of the returned data? It could just be the >>>>>>>> time it takes to serialise the data being returned >>>>>>>> >>>>>>>> James >>>>>>>> >>>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi there, >>>>>>>>> >>>>>>>>> this query >>>>>>>>> >>>>>>>>> for doc in import >>>>>>>>> filter doc._type == 'Image' >>>>>>>>> return {path: doc._path, key: doc._key} >>>>>>>>> >>>>>>>>> takes about 45 seconds on decent hardware with an import >>>>>>>>> collection of about 100.000 items with about 21.000 of _type = >>>>>>>>> 'Image'. >>>>>>>>> There is an index of _type. Using PyArango as client...I really >>>>>>>>> wander why this query is running so slow?! >>>>>>>>> >>>>>>>>> Running ArangoDB 3.4.3 >>>>>>>>> >>>>>>>>> Profile >>>>>>>>> >>>>>>>>> Query String: >>>>>>>>> for doc in import >>>>>>>>> filter doc._type == 'Image' >>>>>>>>> return {path: doc._path, key: doc._key} >>>>>>>>> >>>>>>>>> Execution plan: >>>>>>>>> Id NodeType Calls Items Runtime [s] Comment >>>>>>>>> 1 SingletonNode 1 1 0.00000 * ROOT >>>>>>>>> 7 IndexNode 21 20617 32.73956 - FOR doc >>>>>>>>> IN import /* hash index scan, projections: `_key`, `_path` */ >>>>>>>>> 5 CalculationNode 21 20617 0.04354 - LET #3 >>>>>>>>> = { "path" : doc.`_path`, "key" : doc.`_key` } /* simple expression >>>>>>>>> */ >>>>>>>>> /* collections used: doc : import */ >>>>>>>>> 6 ReturnNode 21 20617 0.00016 - RETURN >>>>>>>>> #3 >>>>>>>>> >>>>>>>>> Indexes used: >>>>>>>>> By Type Collection Unique Sparse Selectivity Fields >>>>>>>>> Ranges >>>>>>>>> 7 hash import false false 0.05 % [ >>>>>>>>> `_type` ] (doc.`_type` == "Image") >>>>>>>>> >>>>>>>>> Optimization rules applied: >>>>>>>>> Id RuleName >>>>>>>>> 1 move-calculations-up >>>>>>>>> 2 move-filters-up >>>>>>>>> 3 move-calculations-up-2 >>>>>>>>> 4 move-filters-up-2 >>>>>>>>> 5 use-indexes >>>>>>>>> 6 remove-filter-covered-by-index >>>>>>>>> 7 remove-unnecessary-calculations-2 >>>>>>>>> 8 reduce-extraction-to-projection >>>>>>>>> >>>>>>>>> Query Statistics: >>>>>>>>> Writes Exec Writes Ign Scan Full Scan Index Filtered >>>>>>>>> Exec Time [s] >>>>>>>>> 0 0 0 20617 0 >>>>>>>>> 32.78928 >>>>>>>>> >>>>>>>>> Query Profile: >>>>>>>>> Query Stage Duration [s] >>>>>>>>> initializing 0.00001 >>>>>>>>> parsing 0.00010 >>>>>>>>> optimizing ast 0.00001 >>>>>>>>> loading collections 0.00002 >>>>>>>>> instantiating plan 0.00005 >>>>>>>>> optimizing plan 0.00032 >>>>>>>>> executing 32.78841 >>>>>>>>> finalizing 0.00032 >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "ArangoDB" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> -- You received this message because you are subscribed to the Google Groups "ArangoDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/24c7cc65-765e-402e-a779-f04544b444c0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
