If you're interested in testing a big dataset, you can try this OSM dataset which comes in a simple CSV format. https://drive.google.com/open?id=0B1jY75xGiy7eNjJuRy1KWjRieVU It contains 2.7 billion records with a very few duplicates. This is how the dataset looks like http://spatialhadoop.cs.umn.edu/datasets/osm2/all_nodes.pyramid/
Thanks Ahmed On Thu, Sep 15, 2016 at 10:13 PM Khurram Faraaz <[email protected]> wrote: > @Pouria here is Uber trip data > > https://github.com/fivethirtyeight/uber-tlc-foil-response > > On Sep 16, 2016 1:21 AM, "Chen Li" <[email protected]> wrote: > > > @Wail: as a use case related to selectivity, our current Cloudberry > > prototype doesn't benefit from R-tree when the user is analyzing the data > > for the entire US. But we expect to have R-tree benefits when a user > zooms > > into a small region. > > > > On Thu, Sep 15, 2016 at 8:25 AM, Wail Alkowaileet <[email protected]> > > wrote: > > > > > Hi Ahmed and Mike, > > > > > > @Ahmed > > > I actually did a small experiment where I loaded about 1/5 of the data > > (so > > > I can index it) and seems that the R-Tree was really useful for > querying > > > small regions or neighborhoods. > > > I also tried the B-Tree and it was slower than a full scan. > > > > > > @Mike > > > Unfortunately, I cannot still even after anonymization :-) > > > > > > > > > On Wed, Sep 14, 2016 at 11:29 PM, Mike Carey <[email protected]> > wrote: > > > > > > > Interesting point, so to speak. @Wail, any chance you could post a > > > Google > > > > maps screenshot showing a visualization of the points in this dataset > > on > > > > the underlying geographic region? (If the dataset is shareable in > that > > > > anonymized form?) I would think an R-tree would still be good for > > > > small-region geo queries - possibly shrinking the candidate object > set > > > by a > > > > factor of 10,000 - so still useful - and we also do index-AND-ing > now, > > so > > > > we would also combine that shrinkage by other index-provided > shrinkage > > on > > > > any other index-amenable predicates. I think the queries are still > > > spatial > > > > in nature, and the only AsterixDB choices for that are R-tree. (We > did > > > > experiments with things like Hilbert B-trees, but the results led to > > the > > > > conclusion that the code base only needs R-trees for spatial data for > > the > > > > forseeable future - they just work too well and in a > no-tuning-required > > > > fashion.... :-)) > > > > > > > > > > > > > > > > On 9/14/16 12:49 PM, Ahmed Eldawy wrote: > > > > > > > >> Looks like an interesting case. Just a small question. Are you sure > a > > > >> spatial index is the right one to use here? The spatial attribute > > looks > > > >> more like a categorization and a hash or B-tree index could be more > > > >> suitable. As far as I know, the spatial index in AsterixDB is a > > > secondary > > > >> R-tree index which, like any other secondary index, is only good for > > > >> retrieving a small number of records. For this dataset, it seems > that > > > any > > > >> small range would still return a huge number of records. > > > >> > > > >> It is still interesting to further investigate and fix the sort > issue > > > but > > > >> I > > > >> mentioned the usage issue for a different perspective. > > > >> > > > >> Thanks > > > >> Ahmed > > > >> > > > >> On Wed, Sep 14, 2016 at 10:30 AM Mike Carey <[email protected]> > > wrote: > > > >> > > > >> ☺! > > > >>> > > > >>> On Sep 14, 2016 1:11 AM, "Wail Alkowaileet" <[email protected]> > > > wrote: > > > >>> > > > >>> To be exact > > > >>>> I have 2,255,091,590 records and 10,391 points :-) > > > >>>> > > > >>>> On Wed, Sep 14, 2016 at 10:46 AM, Mike Carey <[email protected]> > > > wrote: > > > >>>> > > > >>>> Thx! I knew I'd meant to "activate" the thought somehow, but > > couldn't > > > >>>>> remember having done it for sure. Oops! Scattered from VLDB, I > > > >>>>> > > > >>>> guess...! > > > >>> > > > >>>> > > > >>>>> > > > >>>>> On 9/13/16 9:58 PM, Taewoo Kim wrote: > > > >>>>> > > > >>>>> @Mike: You filed an issue - > > > >>>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1639. :-) > > > >>>>>> > > > >>>>>> Best, > > > >>>>>> Taewoo > > > >>>>>> > > > >>>>>> On Tue, Sep 13, 2016 at 9:28 PM, Mike Carey <[email protected]> > > > >>>>>> > > > >>>>> wrote: > > > >>> > > > >>>> I can't remember (slight jetlag? :-)) if I shared back to this > list > > > >>>>>> > > > >>>>> one > > > >>> > > > >>>> theory that came up in India when Wail and I talked F2F - his data > > > >>>>>>> > > > >>>>>> has > > > >>> > > > >>>> a > > > >>>> > > > >>>>> lot of duplicate points, so maybe something goes awry in that > case. > > > >>>>>>> > > > >>>>>> I > > > >>> > > > >>>> wonder if we've sufficiently tested that case? (E.g., what if > there > > > >>>>>>> > > > >>>>>> are > > > >>>> > > > >>>>> gazillions of records originating from a small handful of > points?) > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> On 8/26/16 9:55 AM, Taewoo Kim wrote: > > > >>>>>>> > > > >>>>>>> Based on a rough calculation, per partition, each point field > > takes > > > >>>>>>> > > > >>>>>> 3.6GB > > > >>>> > > > >>>>> (16 bytes * 2887453794 records / 12 partition). To sort 3.6GB, we > > > >>>>>>>> > > > >>>>>>> are > > > >>> > > > >>>> generating 625 files (96MB or 128MB each) = 157GB. Since Wail > > > >>>>>>>> > > > >>>>>>> mentioned > > > >>>> > > > >>>>> that there was no issue when creating a B+ tree index, we need to > > > >>>>>>>> > > > >>>>>>> check > > > >>>> > > > >>>>> what SORT process is required by R-Tree index. > > > >>>>>>>> > > > >>>>>>>> Best, > > > >>>>>>>> Taewoo > > > >>>>>>>> > > > >>>>>>>> On Fri, Aug 26, 2016 at 7:52 AM, Jianfeng Jia < > > > >>>>>>>> > > > >>>>>>> [email protected] > > > >>> > > > >>>> wrote: > > > >>>>>>>> > > > >>>>>>>> If all of the file names start with > “ExternalSortRunGenerator”, > > > then > > > >>>>>>>> they > > > >>>>>>>> > > > >>>>>>>> are the first round files which can not be GCed. > > > >>>>>>>>> Could you provide the query plan as well? > > > >>>>>>>>> > > > >>>>>>>>> On Aug 24, 2016, at 10:02 PM, Wail Alkowaileet < > > > [email protected] > > > >>>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>> Hi Ian and Pouria, > > > >>>>>>>>> > > > >>>>>>>>>> The name of the files along with the sizes (there were 625 > one > > > of > > > >>>>>>>>>> those > > > >>>>>>>>>> before crashing): > > > >>>>>>>>>> > > > >>>>>>>>>> size name > > > >>>>>>>>>> 96MB ExternalSortRunGenerator8917133039835449370.waf > > > >>>>>>>>>> 128MB ExternalSortRunGenerator8948724728025392343.waf > > > >>>>>>>>>> > > > >>>>>>>>>> no files were generated beyond runs. > > > >>>>>>>>>> compiler.sortmemory = 64MB > > > >>>>>>>>>> > > > >>>>>>>>>> Here is the full logs > > > >>>>>>>>>> <https://www.dropbox.com/s/k2qbo3wybc8mnnk/log_Thu_Aug_ > > > >>>>>>>>>> > > > >>>>>>>>>> 25_07%3A34%3A52_AST_2016.zip?dl=0> > > > >>>>>>>>>> > > > >>>>>>>>> On Tue, Aug 23, 2016 at 9:29 PM, Pouria Pirzadeh < > > > >>>>>>>>> > > > >>>>>>>>>> [email protected]> > > > >>>>>>>>>> > > > >>>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> We previously had issues with huge spilled sort temp files > > when > > > >>>>>>>>>> creating > > > >>>>>>>>>> > > > >>>>>>>>>> inverted index for fuzzy queries, but NOT R-Trees. > > > >>>>>>>>>>> I also recall that Yingyi fixed the issue of delaying > > clean-up > > > >>>>>>>>>>> > > > >>>>>>>>>> for > > > >>> > > > >>>> intermediate temp files until the end of the query execution. > > > >>>>>>>>>>> If you can share names of a couple of temp files (and their > > > sizes > > > >>>>>>>>>>> along > > > >>>>>>>>>>> with the sort memory setting you have in > > > >>>>>>>>>>> > > > >>>>>>>>>> asterix-configuration.xml) > > > >>> > > > >>>> we > > > >>>>>>>>>>> > > > >>>>>>>>>>> may > > > >>>>>>>>>>> > > > >>>>>>>>>> be able to have a better guess as if the sort is really > going > > > >>>>>>>>>> > > > >>>>>>>>> into a > > > >>> > > > >>>> two-level merge or not. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Pouria > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Tue, Aug 23, 2016 at 11:09 AM, Ian Maxon < > [email protected]> > > > >>>>>>>>>>> > > > >>>>>>>>>> wrote: > > > >>>> > > > >>>>> I think that execption ("No space left on device") is just casted > > > >>>>>>>>>>> from > > > >>>>>>>>>>> the > > > >>>>>>>>>>> > > > >>>>>>>>>>> native IOException. Therefore I would be inclined to > believe > > > it's > > > >>>>>>>>>>> > > > >>>>>>>>>>>> genuinely > > > >>>>>>>>>>>> > > > >>>>>>>>>>> out of space. I suppose the question is why the external > sort > > > is > > > >>>>>>>>>>> > > > >>>>>>>>>> so > > > >>> > > > >>>> huge. > > > >>>>>>>>>>>> > > > >>>>>>>>>>> What is the query plan? Maybe that will shed light on a > > > possible > > > >>>>>>>>>> cause. > > > >>>>>>>>>> > > > >>>>>>>>>> On Tue, Aug 23, 2016 at 9:59 AM, Wail Alkowaileet < > > > >>>>>>>>>>> > > > >>>>>>>>>>>> [email protected] > > > >>>>>>>>>>>> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I was monitoring Inodes ... it didn't go beyond 1%. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Tue, Aug 23, 2016 at 7:58 PM, Wail Alkowaileet < > > > >>>>>>>>>>>>> [email protected] > > > >>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Hi Chris and Mike, > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Actually I was monitoring it to see what's going on: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> - The size of each partition is about 40GB (80GB in > > > total > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> per > > > >>>> > > > >>>>> iodevice). > > > >>>>>>>>>>>>>> - The runs took 157GB per iodevice (about 2x of the > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> dataset > > > >>> > > > >>>> size). > > > >>>>>>>>>>>>>> Each run takes either of 128MB or 96MB of storage. > > > >>>>>>>>>>>>>> - At a certain time, there were 522 runs. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> I even tried to create a BTree Index to see if that > > happens > > > as > > > >>>>>>>>>>>>>> well. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> I > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> created two BTree indexes one for the *location* and one > > for > > > >>>>>>>>>>>> the > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> *caller > > > >>>>>>>>>>>>> *and > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> they were created successfully. The sizes of the runs > > didn't > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> take > > > >>> > > > >>>> anyway > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> near that. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Logs are attached. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Tue, Aug 23, 2016 at 7:19 PM, Mike Carey < > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> [email protected]> > > > >>> > > > >>>> wrote: > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> I think we might have "file GC issues" - I vaguely > remember > > > >>>>>>>>>>>> that > > > >>>>>>>>>>>> > > > >>>>>>>>>>> we > > > >>>> > > > >>>>> don't > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> (or at least didn't once upon a time) proactively remove > > > >>>>>>>>>>>>>> unnecessary > > > >>>>>>>>>>>>>> run > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> files - removing all of them at end-of-job instead of at > > the > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> end > > > >>> > > > >>>> of > > > >>>> > > > >>>>> the > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> execution phase that uses their contents. We may also > > have > > > an > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> "Amdahl > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> problem" right now with our sort since we serialize > phase > > > two > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> of > > > >>> > > > >>>> parallel > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> sorts - though this is not a query, it's index build, so > > > that > > > >>>>>>>>>>>>>> shouldn't > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> be > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> it. It would be interesting to put a df/sleep script on > > each > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> of > > > >>> > > > >>>> the > > > >>>>>>>>>>>>>> nodes > > > >>>>>>>>>>>>>> when this is happening - actually a script that monitors > > the > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> temp > > > >>>> > > > >>>>> file > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> directory - and watch the lifecycle happen and the sizes > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> change.... > > > >>>> > > > >>>>> On 8/23/16 2:06 AM, Chris Hillery wrote: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> When you get the "disk full" warning, do a quick "df > -i" > > on > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> the > > > >>> > > > >>>> device > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> - > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> possibly you've run out of inodes even if the space isn't > > all > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> used > > > >>>> > > > >>>>> up. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> It's > > > >>>>>>>>>>>>>> unlikely because I don't think AsterixDB creates a bunch > > of > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> small > > > >>>> > > > >>>>> files, > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> but worth checking. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> If that's not it, then can you share the full exception > > and > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> stack > > > >>>> > > > >>>>> trace? > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Ceej > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> aka Chris Hillery > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> On Tue, Aug 23, 2016 at 1:59 AM, Wail Alkowaileet < > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> [email protected]> > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>> I just cleared the hard drives to get 80% free space. I > > > still > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> get > > > >>>> > > > >>>>> the > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> same > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> issue. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> The data contains: > > > >>>>>>>>>>>>>>>>> 1- 2887453794 records. > > > >>>>>>>>>>>>>>>>> 2- Schema: > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> create type CDRType as { > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> id:uuid, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 'date':string, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 'time':string, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 'duration':int64, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 'caller':int64, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 'callee':int64, > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> location:point? > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> } > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> On Tue, Aug 23, 2016 at 9:06 AM, Wail Alkowaileet < > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> [email protected] > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> wrote: > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Dears, > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> I have a dataset of size 290GB loaded in a 3 NCs each > of > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> which > > > >>>> > > > >>>>> has > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> 2x500GB > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> SSD. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> Each of NC has two IODevices (partitions) in each hard > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> drive > > > >>> > > > >>>> (i.e > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> the > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> total is 4 iodevices per NC). After loading the data, > > > each > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Asterix > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> partition occupied 31GB. > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> The cluster has about 50% free space in each hard drive > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> (approximately > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> about 250GB free space in each hard drive). However, > > > when I > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> tried > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> to > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> create > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> an index of type RTree, I got an exception that no > space > > > left > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> in > > > >>> > > > >>>> the > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> hard > > > >>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> drive during the External Sort phase. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Is that normal ? > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> -- > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> *Regards,* > > > >>>>>>>>>>>>>>>>>> Wail Alkowaileet > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> -- > > > >>>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>>> *Regards,* > > > >>>>>>>>>>>>>>>>> Wail Alkowaileet > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>>> -- > > > >>>>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>>> *Regards,* > > > >>>>>>>>>>>>>> Wail Alkowaileet > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> -- > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>> *Regards,* > > > >>>>>>>>>>>>> Wail Alkowaileet > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> -- > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> *Regards,* > > > >>>>>>>>>> Wail Alkowaileet > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> Best, > > > >>>>>>>>> > > > >>>>>>>>> Jianfeng Jia > > > >>>>>>>>> PhD Candidate of Computer Science > > > >>>>>>>>> University of California, Irvine > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>>>>>>> > > > >>>> -- > > > >>>> > > > >>>> *Regards,* > > > >>>> Wail Alkowaileet > > > >>>> > > > >>>> > > > > > > > > > > > > > -- > > > > > > *Regards,* > > > Wail Alkowaileet > > > > > >
