Hi, if I read your situation correctly, then you store the meta data and the binary data in one document, correct?
In that case, ArangoDB might be accessing the whole document which means the whole binary data has to be searched through as well. ArangoDB is not optimized for large blob storage. If you use cases allows this, you could separate metadata and binary data into two collections. Theryb, you can have fast queries on meta data and only lookup the relevant binary data if necessary. Best solution is to use a dedicated file system for the blob storage. Hope that helps Am Fr., 14. Juni 2019 um 13:10 Uhr schrieb Andreas Jung < [email protected]>: > All _path value are unique, we have about 20 different values for _type. > I am not sure if I can break down the dataset into something smaller. > The data is in general sensitive and not easy to share or anonymize. > > > Am Freitag, 14. Juni 2019 13:03:59 UTC+2 schrieb Wilfried Gösgens: >> >> >> Hi, >> Can you share a set of sample documents? How well is the distribution on >> `_type` ? Which samples are there? >> On Friday, June 14, 2019 at 11:22:51 AM UTC+2, Andreas Jung wrote: >>> >>> Recreating the indexes after import does not make a difference. >>> >>> Returning doc._path for 20.000 items takes 50 ms, returning doc._path >>> takes minutes >>> >>> The _path index is deduplicated, the _type index is not >>> >>> The only difference in the execution plans is "index only" when "RETURN >>> doc._type". Since both _type and _path >>> are fully indexed I would assume that the query is executed in both >>> times based on index data. >>> >>> So ArangoDB will load all 100.000 objects for picking up the value of >>> _path? The overall data is meanwhile 55 GB >>> (about one third of the data is binary data (files and images base64 >>> encoded). >>> >>> This is all no big problem for me since we perform such queries once >>> before a migration run and it does matter taking >>> a migration running for some hours a minutes more or less but I want to >>> understand what is going on here (in particular >>> this is unexpected behavior). >>> >>> >>> Query String: >>> for doc in import >>> filter doc._type == 'Image' >>> return doc._type >>> >>> Execution plan: >>> Id NodeType Est. Comment >>> 1 SingletonNode 1 * ROOT >>> 7 IndexNode 2214 - FOR doc IN import /* hash index >>> scan, index only, projections: `_type` */ >>> 5 CalculationNode 2214 - LET #3 = doc.`_type` /* attribute >>> expression */ /* collections used: doc : import */ >>> 6 ReturnNode 2214 - RETURN #3 >>> >>> Indexes used: >>> By Type Collection Unique Sparse Selectivity Fields >>> Ranges >>> 7 hash import false false 0.05 % [ `_type` ] >>> (doc.`_type` == "Image") >>> >>> Optimization rules applied: >>> Id RuleName >>> 1 move-calculations-up >>> 2 move-filters-up >>> 3 move-calculations-up-2 >>> 4 move-filters-up-2 >>> 5 use-indexes >>> 6 remove-filter-covered-by-index >>> 7 remove-unnecessary-calculations-2 >>> 8 reduce-extraction-to-projection >>> >>> >>> >>> Query String: >>> for doc in import >>> filter doc._type == 'Image' >>> return doc._path >>> >>> Execution plan: >>> Id NodeType Est. Comment >>> 1 SingletonNode 1 * ROOT >>> 7 IndexNode 2214 - FOR doc IN import /* hash index >>> scan, projections: `_path` */ >>> 5 CalculationNode 2214 - LET #3 = doc.`_path` /* attribute >>> expression */ /* collections used: doc : import */ >>> 6 ReturnNode 2214 - RETURN #3 >>> >>> Indexes used: >>> By Type Collection Unique Sparse Selectivity Fields >>> Ranges >>> 7 hash import false false 0.05 % [ `_type` ] >>> (doc.`_type` == "Image") >>> >>> Optimization rules applied: >>> Id RuleName >>> 1 move-calculations-up >>> 2 move-filters-up >>> 3 move-calculations-up-2 >>> 4 move-filters-up-2 >>> 5 use-indexes >>> 6 remove-filter-covered-by-index >>> 7 remove-unnecessary-calculations-2 >>> 8 reduce-extraction-to-projection >>> >>> >>> >>> >>> On Friday, June 14, 2019 at 9:54:10 AM UTC+2, Andreas Jung wrote: >>>> >>>> Using RocksDB (default installation). >>>> >>>> I create a new collection for every import of the data including the >>>> indexes. >>>> >>>> Unfortunately I don't have the key names in my hands. They are coming >>>> from a JSON dump of a CMS. >>>> >>>> Am Freitag, 14. Juni 2019 09:50:41 UTC+2 schrieb Wilfried Gösgens: >>>>> >>>>> Hi, >>>>> afair you're using rocksdb? >>>>> >>>>> can you try to re-create that index to be on `_type`, `_path`, `_key` >>>>> for better using of projections? >>>>> >>>>> Please note that you shouldn't use fieldnames starting with `_` since >>>>> they're defined as system specific fields in arangodb. >>>>> >>>>> Cheers, >>>>> Willi >>>>> >>>>> On Friday, June 14, 2019 at 9:41:24 AM UTC+2, Andreas Jung wrote: >>>>>> >>>>>> _key is a UUID4 >>>>>> _path is standard filesystem path not longer than 100 chars each >>>>>> >>>>>> That can not be the problem. >>>>>> >>>>>> Am Freitag, 14. Juni 2019 09:36:17 UTC+2 schrieb James >>>>>> Courtier-Dutton: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> What is the average size of the returned data? It could just be the >>>>>>> time it takes to serialise the data being returned >>>>>>> >>>>>>> James >>>>>>> >>>>>>> On Fri, 14 Jun 2019, 05:45 'Andreas Jung' via ArangoDB, < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi there, >>>>>>>> >>>>>>>> this query >>>>>>>> >>>>>>>> for doc in import >>>>>>>> filter doc._type == 'Image' >>>>>>>> return {path: doc._path, key: doc._key} >>>>>>>> >>>>>>>> takes about 45 seconds on decent hardware with an import collection >>>>>>>> of about 100.000 items with about 21.000 of _type = 'Image'. >>>>>>>> There is an index of _type. Using PyArango as client...I really >>>>>>>> wander why this query is running so slow?! >>>>>>>> >>>>>>>> Running ArangoDB 3.4.3 >>>>>>>> >>>>>>>> Profile >>>>>>>> >>>>>>>> Query String: >>>>>>>> for doc in import >>>>>>>> filter doc._type == 'Image' >>>>>>>> return {path: doc._path, key: doc._key} >>>>>>>> >>>>>>>> Execution plan: >>>>>>>> Id NodeType Calls Items Runtime [s] Comment >>>>>>>> 1 SingletonNode 1 1 0.00000 * ROOT >>>>>>>> 7 IndexNode 21 20617 32.73956 - FOR doc >>>>>>>> IN import /* hash index scan, projections: `_key`, `_path` */ >>>>>>>> 5 CalculationNode 21 20617 0.04354 - LET #3 >>>>>>>> = { "path" : doc.`_path`, "key" : doc.`_key` } /* simple expression >>>>>>>> */ >>>>>>>> /* collections used: doc : import */ >>>>>>>> 6 ReturnNode 21 20617 0.00016 - RETURN >>>>>>>> #3 >>>>>>>> >>>>>>>> Indexes used: >>>>>>>> By Type Collection Unique Sparse Selectivity Fields >>>>>>>> Ranges >>>>>>>> 7 hash import false false 0.05 % [ `_type` >>>>>>>> ] (doc.`_type` == "Image") >>>>>>>> >>>>>>>> Optimization rules applied: >>>>>>>> Id RuleName >>>>>>>> 1 move-calculations-up >>>>>>>> 2 move-filters-up >>>>>>>> 3 move-calculations-up-2 >>>>>>>> 4 move-filters-up-2 >>>>>>>> 5 use-indexes >>>>>>>> 6 remove-filter-covered-by-index >>>>>>>> 7 remove-unnecessary-calculations-2 >>>>>>>> 8 reduce-extraction-to-projection >>>>>>>> >>>>>>>> Query Statistics: >>>>>>>> Writes Exec Writes Ign Scan Full Scan Index Filtered >>>>>>>> Exec Time [s] >>>>>>>> 0 0 0 20617 0 >>>>>>>> 32.78928 >>>>>>>> >>>>>>>> Query Profile: >>>>>>>> Query Stage Duration [s] >>>>>>>> initializing 0.00001 >>>>>>>> parsing 0.00010 >>>>>>>> optimizing ast 0.00001 >>>>>>>> loading collections 0.00002 >>>>>>>> instantiating plan 0.00005 >>>>>>>> optimizing plan 0.00032 >>>>>>>> executing 32.78841 >>>>>>>> finalizing 0.00032 >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "ArangoDB" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/arangodb/6c2de54c-3936-4aa5-8b6a-2dae3e5afcf7%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> -- > You received this message because you are subscribed to the Google Groups > "ArangoDB" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/arangodb/de155599-d9d8-4b9a-b436-6c1e25a435f9%40googlegroups.com > <https://groups.google.com/d/msgid/arangodb/de155599-d9d8-4b9a-b436-6c1e25a435f9%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- *Jan Stücke* Head of Communications [email protected] | +49 (0)221 / 2722999-60 *Help us grow the multi-model vision with your review on Gartner Peer Reviews <https://www.gartner.com/reviews/market/operational-dbms/vendor/arangodb>. -- You received this message because you are subscribed to the Google Groups "ArangoDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/arangodb/CAL8q3Sy0uEUwS5RNOLbKNDL2rRd6eF7dJG17x9ry7Y%2BSGCvBgQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
