Alexandr,

This has been tested many times already by our users and the answer is
simple - it depends :) Any approach has its pros and cons and you never
know which one will better for particular use case, database, data model,
hardware, etc.

Having said that, you will never find the best way to load the data,
because it just doesn't exist. What I propose is just to make the API more
generic and give user even more control than they have now.

-Val

On Fri, Nov 18, 2016 at 6:53 AM, Alexandr Kuramshin <[email protected]>
wrote:

> Dmitriy,
>
> I will not be fully confident that partition ID is the best approach in all
> cases. Even if we have full access to the database structure, there are
> another problems.
>
> Assume we have a table PERSON (ID NUMBER, NAME VARCHAR, SURNAME VARCHAR,
> AGE NUMBER, EMPL_DATE DATE). And we add our column PART NUMBER.
>
> While we already have indexes IDX1(NAME), IDX2(SURNAME), IDX3(AGE),
> IDX4(EMPL_DATE), we have to add new 2-column index IDX5(PART, EMPL_DATE)
> for pre-loading at startup, for example, recently employed persons.
>
> And if we'd like to query filtered data from the database, we'd also have
> to create the other compound indexes IDX6(PART, NAME), IDX7(PART, SURNAME),
> IDX8(PART, AGE). So we doubling overhead is defined by indexes.
>
> After this modifications on the database has been done and the PART column
> is filled, what we should do to preload the data?
>
> We should perform so many database queries so many partitions are stored on
> the nodes. Number of queries would be 1024 by default settings in the
> affinity functions. Some calls may not return any data at all, and it will
> be a vain network round-trip. Also it may be a problem for some databases
> to effectively perform number of parallel queries without a degradation on
> the total throughput.
>
> DataStreamer approach may be faster, but it should be tested.
>
> 2016-11-16 16:40 GMT+03:00 Dmitriy Setrakyan <[email protected]>:
>
> > On Wed, Nov 16, 2016 at 1:54 PM, Yakov Zhdanov <[email protected]>
> > wrote:
> >
> > > > On Wed, Nov 16, 2016 at 11:22 AM, Yakov Zhdanov <[email protected]
> >
> > > wrote:
> > >
> > > > > > Yakov, I agree that such scenario should be avoided. I also think
> > > that
> > >
> > > > > > loadCache(...) method, as it is right now, provides a way to
> avoid
> > > it.
> > >
> > > > >
> > >
> > > > > No, it does not.
> > >
> > > > >
> > > > Yes it does :)
> > >
> > > No it doesn't. Load cache should either send a query to DB that filters
> > all
> > > the data on server side which, in turn, may result to full-scan of 2 Tb
> > > data set dozens of times (equal to node count) or send a query that
> > brings
> > > the whole dataset to each node which is unacceptable as well.
> > >
> >
> > Why not store the partition ID in the database and query only local
> > partitions? Whatever approach we design with a DataStreamer will be
> slower
> > than this.
> >
>
>
>
> --
> Thanks,
> Alexandr Kuramshin
>

Reply via email to