Re: [arangodb-google] Simple query, pretty slow

Jan Stücke Fri, 14 Jun 2019 06:01:03 -0700

Hi,

if I read your situation correctly, then you store the meta data and the
binary data in one document, correct?


In that case, ArangoDB might be accessing the whole document which means
the whole binary data has to be searched through as well. ArangoDB is not
optimized for large blob storage.
If you use cases allows this, you could separate metadata and binary data
into two collections. Theryb, you can have fast queries on meta data and
only lookup the relevant binary data if necessary. Best solution is to use
a dedicated file system for the blob storage.

Hope that helps

Am Fr., 14. Juni 2019 um 13:10 Uhr schrieb Andreas Jung <
[email protected]>:

> All _path value are unique, we have about 20 different values for _type.
> I am not sure if I can break down the dataset into something smaller.
> The data is in general sensitive and not easy to share or anonymize.
>
>
> Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens:
>>
>>
>> Hi,
>> Can you share a set of sample documents? How well is the distribution on
>> `_type` ? Which samples are there?
>> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote:
>>>
>>> Recreating the indexes after import does not make a difference.
>>>
>>> Returning doc._path  for 20.000 items takes 50 ms, returning doc._path
>>> takes minutes
>>>
>>> The _path index is deduplicated, the _type index is not
>>>
>>> The only difference in the execution plans is "index only" when "RETURN
>>> doc._type". Since both _type and _path
>>> are fully indexed I would assume that the query is executed in both
>>> times based on index data.
>>>
>>> So ArangoDB will load all 100.000 objects for picking up the value of
>>> _path? The overall data is meanwhile 55 GB
>>> (about one third of the data is binary data (files and images base64
>>> encoded).
>>>
>>> This is all no big problem for me since we perform such queries once
>>> before a migration run and it does matter taking
>>> a migration running for some hours a minutes more or less but I want to
>>> understand what is going on here (in particular
>>> this is unexpected behavior).
>>>
>>>
>>> Query String:
>>>  for doc in import
>>>  filter doc._type == 'Image'
>>>  return doc._type
>>>
>>> Execution plan:
>>>  Id   NodeType          Est.   Comment
>>>   1   SingletonNode        1   * ROOT
>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index
>>> scan, index only, projections: `_type` */
>>>   5   CalculationNode   2214       - LET #3 = doc.`_type`   /* attribute
>>> expression */   /* collections used: doc : import */
>>>   6   ReturnNode        2214       - RETURN #3
>>>
>>> Indexes used:
>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields
>>> Ranges
>>>   7   hash   import       false    false         0.05 %   [ `_type` ]
>>>  (doc.`_type` == "Image")
>>>
>>> Optimization rules applied:
>>>  Id   RuleName
>>>   1   move-calculations-up
>>>   2   move-filters-up
>>>   3   move-calculations-up-2
>>>   4   move-filters-up-2
>>>   5   use-indexes
>>>   6   remove-filter-covered-by-index
>>>   7   remove-unnecessary-calculations-2
>>>   8   reduce-extraction-to-projection
>>>
>>>
>>>
>>> Query String:
>>>  for doc in import
>>>  filter doc._type == 'Image'
>>>  return doc._path
>>>
>>> Execution plan:
>>>  Id   NodeType          Est.   Comment
>>>   1   SingletonNode        1   * ROOT
>>>   7   IndexNode         2214     - FOR doc IN import   /* hash index
>>> scan, projections: `_path` */
>>>   5   CalculationNode   2214       - LET #3 = doc.`_path`   /* attribute
>>> expression */   /* collections used: doc : import */
>>>   6   ReturnNode        2214       - RETURN #3
>>>
>>> Indexes used:
>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields
>>> Ranges
>>>   7   hash   import       false    false         0.05 %   [ `_type` ]
>>>  (doc.`_type` == "Image")
>>>
>>> Optimization rules applied:
>>>  Id   RuleName
>>>   1   move-calculations-up
>>>   2   move-filters-up
>>>   3   move-calculations-up-2
>>>   4   move-filters-up-2
>>>   5   use-indexes
>>>   6   remove-filter-covered-by-index
>>>   7   remove-unnecessary-calculations-2
>>>   8   reduce-extraction-to-projection
>>>
>>>
>>>
>>>
>>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote:
>>>>
>>>> Using RocksDB (default installation).
>>>>
>>>> I create a new collection for every import of the data including the
>>>> indexes.
>>>>
>>>> Unfortunately I don't have the key names in my hands. They are coming
>>>> from a JSON dump of a CMS.
>>>>
>>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens:
>>>>>
>>>>> Hi,
>>>>> afair you're using rocksdb?
>>>>>
>>>>> can you try to re-create that index to be on `_type`, `_path`, `_key`
>>>>> for better using of projections?
>>>>>
>>>>> Please note that you shouldn't use fieldnames starting with `_` since
>>>>> they're defined as system specific fields in arangodb.
>>>>>
>>>>> Cheers,
>>>>> Willi
>>>>>
>>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote:
>>>>>>
>>>>>> _key is a UUID4
>>>>>> _path is standard filesystem path not longer than 100 chars each
>>>>>>
>>>>>> That can not be the problem.
>>>>>>
>>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James
>>>>>> Courtier-Dutton:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> What is the average size of the returned data? It could just be the
>>>>>>> time it takes to serialise the data being returned
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> this query
>>>>>>>>
>>>>>>>>  for doc in import
>>>>>>>>    filter doc._type == 'Image'
>>>>>>>>    return {path: doc._path, key: doc._key}
>>>>>>>>
>>>>>>>> takes about 45 seconds on decent hardware with an import collection
>>>>>>>> of about 100.000 items with about 21.000 of _type = 'Image'.
>>>>>>>> There is an index of _type. Using PyArango as client...I really
>>>>>>>> wander why this query is running so slow?!
>>>>>>>>
>>>>>>>> Running ArangoDB 3.4.3
>>>>>>>>
>>>>>>>> Profile
>>>>>>>>
>>>>>>>> Query String:
>>>>>>>>  for doc in import
>>>>>>>>  filter doc._type == 'Image'
>>>>>>>>  return {path: doc._path, key: doc._key}
>>>>>>>>
>>>>>>>> Execution plan:
>>>>>>>>  Id   NodeType          Calls   Items   Runtime [s]   Comment
>>>>>>>>   1   SingletonNode         1       1       0.00000   * ROOT
>>>>>>>>   7   IndexNode            21   20617      32.73956     - FOR doc
>>>>>>>> IN import   /* hash index scan, projections: `_key`, `_path` */
>>>>>>>>   5   CalculationNode      21   20617       0.04354       - LET #3
>>>>>>>> = { "path" : doc.`_path`, "key" : doc.`_key` }   /* simple expression 
>>>>>>>> */
>>>>>>>>  /* collections used: doc : import */
>>>>>>>>   6   ReturnNode           21   20617       0.00016       - RETURN
>>>>>>>> #3
>>>>>>>>
>>>>>>>> Indexes used:
>>>>>>>>  By   Type   Collection   Unique   Sparse   Selectivity   Fields
>>>>>>>>     Ranges
>>>>>>>>   7   hash   import       false    false         0.05 %   [ `_type`
>>>>>>>> ]   (doc.`_type` == "Image")
>>>>>>>>
>>>>>>>> Optimization rules applied:
>>>>>>>>  Id   RuleName
>>>>>>>>   1   move-calculations-up
>>>>>>>>   2   move-filters-up
>>>>>>>>   3   move-calculations-up-2
>>>>>>>>   4   move-filters-up-2
>>>>>>>>   5   use-indexes
>>>>>>>>   6   remove-filter-covered-by-index
>>>>>>>>   7   remove-unnecessary-calculations-2
>>>>>>>>   8   reduce-extraction-to-projection
>>>>>>>>
>>>>>>>> Query Statistics:
>>>>>>>>  Writes Exec   Writes Ign   Scan Full   Scan Index   Filtered
>>>>>>>>  Exec Time [s]
>>>>>>>>            0            0           0        20617          0
>>>>>>>>   32.78928
>>>>>>>>
>>>>>>>> Query Profile:
>>>>>>>>  Query Stage           Duration [s]
>>>>>>>>  initializing               0.00001
>>>>>>>>  parsing                    0.00010
>>>>>>>>  optimizing ast             0.00001
>>>>>>>>  loading collections        0.00002
>>>>>>>>  instantiating plan         0.00005
>>>>>>>>  optimizing plan            0.00032
>>>>>>>>  executing                 32.78841
>>>>>>>>  finalizing                 0.00032
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "ArangoDB" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>> --
> You received this message because you are subscribed to the Google Groups
> "ArangoDB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/arangodb/de155599-d9d8-4b9a-b436-6c1e25a435f9%40googlegroups.com
> <https://groups.google.com/d/msgid/arangodb/de155599-d9d8-4b9a-b436-6c1e25a435f9%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

*Jan Stücke*
Head of Communications

[email protected] | +49 (0)221 / 2722999-60


*Help us grow the multi-model vision with your review on Gartner Peer
Reviews
<https://www.gartner.com/reviews/market/operational-dbms/vendor/arangodb>.

-- 
You received this message because you are subscribed to the Google Groups 
"ArangoDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/arangodb/CAL8q3Sy0uEUwS5RNOLbKNDL2rRd6eF7dJG17x9ry7Y%2BSGCvBgQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [arangodb-google] Simple query, pretty slow

Reply via email to