Re: EntityBatchIterator for large data set queries

Pawan Verma Mon, 20 Jul 2020 02:11:26 -0700

Hi Prakhar,

Glad to know that this implementation helps you. Thanks for sharing details
:)
-- 
Thanks & Regards
Pawan Verma
Technical Consultant
*HotWax Systems*
*Enterprise open source experts*
http://www.hotwaxsystems.com



On Mon, Jul 20, 2020 at 1:42 PM Prakhar Kumar <
prakhar.ku...@hotwaxsystems.com> wrote:

> Hello Pawan,
>
> We were getting a hard time dealing with large datasets in our client
> project. We were streaming data from MySQL using the FetchSize and
> EntityListIterator, which helped us up to some point, but ultimately
> struggled with the further increase in data. This is where the batch
> processing implementation came to rescue. We incorporated it into our
> project and were able to process the data with ease. This implementation
> seems to be quite scalable and faster in performance as compared to
> streaming. Batch processing was the need of the hour and there we have it
> in OFBiz. Thanks, Pawan for your valuable contribution.
>
> On Wed, Jul 1, 2020 at 4:06 PM Pawan Verma <pawan.ve...@hotwaxsystems.com>
> wrote:
>
> > Hi Chandan, Jacques
> >
> > Thanks, for your feedback.
> >
> > Yes, To solve the problem of heavy entity operations which consumes all
> the
> > system memory, we have implemented EntityBatchIterator. Originally
> designed
> > for the heavy entity operations.
> > --
> > Thanks & Regards
> > Pawan Verma
> > Technical Consultant
> > *HotWax Systems*
> > *Enterprise open source experts*
> > http://www.hotwaxsystems.com
> >
> >
> > On Sun, Jun 28, 2020 at 10:16 PM Jacques Le Roux <
> > jacques.le.r...@les7arts.com> wrote:
> >
> > > Hi,
> > >
> > > I have not looked into any details but Chandan's advice sounds like a
> > wise
> > > one to me
> > >
> > > Jacques
> > >
> > > Le 27/06/2020 à 13:43, Chandan Khandelwal a écrit :
> > > > Hello Pawan,
> > > >
> > > > Approach looks good, my only suggestion is to use batch processing
> only
> > > > when we are dealing with large data set, as this method takes a
> longer
> > > time
> > > > compared to the normal method specially on a distributed environment,
> > > which
> > > > may negatively impact the performance.
> > > >
> > > > Kind Regards,
> > > > Chandan Khandelwal
> > > > Senior Manager, Enterprise Software Development
> > > >
> > > > *HotWax Systems*
> > > > *Enterprise open source experts*
> > > > cell: +91-98934-81076
> > > > office: 0731-409-3684
> > > > http://www.hotwaxsystems.com
> > > >
> > > >
> > > > On Fri, Jun 5, 2020 at 4:07 PM Pawan Verma <
> > > pawan.ve...@hotwaxsystems.com>
> > > > wrote:
> > > >
> > > >> Thanks, Pritam and Scott for the discussion.
> > > >>
> > > >> I've created Jira OFBIZ-11789 for this improvement and also created
> a
> > PR
> > > >> with the proposed changes.
> > > >>
> > > >> I request everyone to review the PR and suggest your thought on
> this.
> > > >> Thanks!
> > > >> --
> > > >> Thanks & Regards
> > > >> Pawan Verma
> > > >> Technical Consultant
> > > >> *HotWax Systems*
> > > >> *Enterprise open source experts*
> > > >> http://www.hotwaxsystems.com
> > > >>
> > > >>
> > > >> On Mon, Jun 1, 2020 at 12:36 PM Pritam Kute <
> > > pritam.k...@hotwaxsystems.com
> > > >> wrote:
> > > >>
> > > >>> Thanks Scott for your detailed explanation.
> > > >>>
> > > >>> The solution looks good to me too. My confusion was with why we are
> > > going
> > > >>> to implement new method if we can achieve that using the current
> > > >>> EntityQuery methods.
> > > >>>
> > > >>> +1 for adding queryBatchIterator() to EntityQuery.
> > > >>>
> > > >>> Kind Regards,
> > > >>> --
> > > >>> Pritam Kute
> > > >>>
> > > >>>
> > > >>> On Thu, May 28, 2020 at 6:32 AM Scott Gray <
> > > scott.g...@hotwaxsystems.com
> > > >>>
> > > >>> wrote:
> > > >>>
> > > >>>> Hi Pritam,
> > > >>>>
> > > >>>> I'm not sure about PostgreSQL or Derby but I know with MySQL that
> > > >> using a
> > > >>>> cursor doesn't really work.  You have to set the result set to
> > > >>>> TYPE_FORWARD_ONLY and CONCUR_READ_ONLY and also set the fetch size
> > to
> > > >>>> INTEGER.MIN_VALUE.  Only then will the driver stream the results
> and
> > > >> even
> > > >>>> then, you may not execute any other SQL commands on the connection
> > > >> until
> > > >>>> you have fully read or closed the resultset.
> > > >>>>
> > > >>>> So if an EntityListIterator doesn't really conserve memory, then
> you
> > > >> need
> > > >>>> to take a paging query approach such as this:
> > > >>>> EntityQuery query =
> > > >>> EntityQuery.use(delegator).from("SomeTable").limit(100)
> > > >>>> List results = null
> > > >>>> while (!(results = query.queryList()).isEmpty()) {
> > > >>>>   for (value : results) {
> > > >>>>    // do something with each value
> > > >>>>   }
> > > >>>>   query.offset(query.getOffset() + query.getLimit())
> > > >>>> }
> > > >>>>
> > > >>>> Or with the proposed EntityBatchIterator:
> > > >>>> Iterator query =
> > > >>>>
> > > >>>>
> > > >>
> > >
> >
> EntityQuery.use(delegator).from("SomeTable").limit(100).queryBatchIterator()
> > > >>>> while (iterator.hasNext()) {
> > > >>>>   result = iterator.next()
> > > >>>>   // do something with each value
> > > >>>> }
> > > >>>>
> > > >>>> I guess an alternative approach would be to implement something
> > > similar
> > > >>>> within the EntityListIterator and perhaps a flag to turn it off or
> > on
> > > >>>> depending on which database is being used and how well it supports
> > > >>>> iterating over results without loading the entire resultset into
> > > >> memory.
> > > >>>> Regards
> > > >>>> Scott
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Sat, 23 May 2020 at 20:59, Pritam Kute <
> > > >> pritam.k...@hotwaxsystems.com
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hello Pawan,
> > > >>>>>
> > > >>>>> I just had a look into the EntityQuery.queryIterator() method and
> > > >> looks
> > > >>>>> like we can achieve that by using fetchSize(), fowardOnly(),
> > > >>>>> cursorScrollInsensitive(), cursorScrollSensitive() and offset()
> > > >> methods
> > > >>>> in
> > > >>>>> EntityQuery class. Let me know if I am missing anything.
> > > >>>>>
> > > >>>>> It will be good if you can post a pseudo code or something here
> so
> > > >> that
> > > >>>> we
> > > >>>>> could get an understanding of the exact design which you have in
> > your
> > > >>>> mind.
> > > >>>>> Kind Regards,
> > > >>>>> --
> > > >>>>> Pritam Kute
> > > >>>>>
> > > >>>>>
> > > >>>>> On Thu, May 21, 2020 at 7:41 PM Pawan Verma <
> > > >>>> pawan.ve...@hotwaxsystems.com
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Hello Devs,
> > > >>>>>>
> > > >>>>>> While working on the large database we have figured out that
> very
> > > >>> large
> > > >>>>>> queries consume all memory and crash ofbiz(because
> queryIterator()
> > > >>>>> doesn't
> > > >>>>>> really work, it's no different from queryList())
> > > >>>>>>
> > > >>>>>> The EntityListIterator attempts to use a cursor to iterate over
> > > >> large
> > > >>>>>> result sets but in reality most databases do not give us a
> cursor
> > > >>>> unless
> > > >>>>> we
> > > >>>>>> ask for it in a very specific way, and instead you get back the
> > > >> full
> > > >>>>> result
> > > >>>>>> set and potentially consume a large amount of memory.  For
> > example,
> > > >>> the
> > > >>>>>> MySql details are here (ResultSet section):
> > > >>>>>>
> > > >>>>>>
> > > >>
> > >
> >
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
> > > >>>>>> To work around this we've built the EntityBatchIterator which
> > > >> instead
> > > >>>>>> basically just pages through results like you would do on a
> > webpage
> > > >>>> that
> > > >>>>>> uses pagination.  The iterator takes an EntityQuery and when
> > next()
> > > >>> is
> > > >>>>>> called it grabs the first 500 records and returns the first
> > record,
> > > >>>> call
> > > >>>>>> next again and you get the second record, once you've called
> next
> > > >> 500
> > > >>>>> times
> > > >>>>>> it runs the query again with an offset and grabs the next 500
> and
> > > >> so
> > > >>> on
> > > >>>>>> until there are no more records.
> > > >>>>>>
> > > >>>>>> The main downsides to this approach are:
> > > >>>>>> 1. Same as when using limit/offset you want to be sure your
> > results
> > > >>>> will
> > > >>>>>> come back in a consistent order to avoid accidentally skipping
> > some
> > > >>>> rows
> > > >>>>>> and seeing other rows twice.
> > > >>>>>> 2. Because the results are a combination of many queries rather
> > > >> than
> > > >>> a
> > > >>>>>> single query, some of the data may change while you are paging
> > > >>> through
> > > >>>>> it.
> > > >>>>>> i.e. if you were to sort by lastUpdatedTxStamp you may see some
> > > >> rows
> > > >>>>> twice
> > > >>>>>> as they are updated by other transactions (this might be a good
> > > >> thing
> > > >>>> in
> > > >>>>>> some cases).
> > > >>>>>>
> > > >>>>>> So, the main proposal is to introduce a new EntityBatchIterator
> > for
> > > >>>> large
> > > >>>>>> queries.  If we agree on the proposal I'll create a Jira and
> > > >> provide
> > > >>> a
> > > >>>>>> patch for the community review. Thanks!
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Thanks & Regards
> > > >>>>>> Pawan Verma
> > > >>>>>> Technical Consultant
> > > >>>>>> *HotWax Systems*
> > > >>>>>> *Enterprise open source experts*
> > > >>>>>> http://www.hotwaxsystems.com
> > > >>>>>>
> > >
> >
>
>
> --
> Kind Regards,
> Prakhar Kumar
> Sr. Enterprise Software Engineer
>
> *HotWax Systems*
> *Enterprise open source experts*
> cell: +91-89628-81820
> office: 0731-409-3684
> http://www.hotwaxsystems.com
>

Re: EntityBatchIterator for large data set queries

Reply via email to