Re: [Boston.pm] Passing large complex data structures between process

Ben Tilly Fri, 05 Apr 2013 09:31:25 -0700

Pro tip.  I've seen both push based systems and pull based systems at
work.  The push based systems tend to break whenever the thing that
you're pushing to has problems.  Pull-based systems tend to be much
more reliable in my experience.


You have described a push-based system.  I would therefore avoid that
design.  Yes, I know you're afraid of locking a hot spot.  But a
simple database piece that does nothing but distribute known jobs
creates more concurrency than most people realize.  A simple Redis
instance can improve that by a crazy factor.  There are ways to scale
farther, but they won't hit that bottleneck until they have a scaling
problem that is orders of magnitude larger than any that they have.

If you disregard this tip, then learn from experience and give thought
in advance to how you're going to monitor the things that you're
pushing to, notice their problems, and fix them when they break.
(Rather than 2 weeks later when someone wonders why their data stopped
updating.)

On Thu, Apr 4, 2013 at 8:43 PM, John Redford <[email protected]> wrote:
>
> David Larochelle wrote:
>>
> [...]
>> We're using Postgresql 8.4 and running on Ubuntu. Almost all data is
> stored in
>> the database. The system contains a list of media sources with associated
> RSS
>> feeds. We have a downloads table that has all of the URLs that we want to
>> download or have downloaded in the past. This table currently has ~200
> million
>> rows. We add downloads for the RSS feed of each source to the downloads
>> table a few times a day. When these RSS feeds are downloaded, they are
>> analyzed for new stories. We add download entries for each new story that
> we
>> encounter.  We have an indexed enum field in the database that lets us
> know if
>> a download in the downloads table has already been downloaded, needs to be
>> downloaded, is queued in memory, or is currently being downloaded. We have
>> ~1.5 million URLs that need to be downloaded at anyone time.
>
> This does sound like it will not scale. Which is only to say what you have
> said.
>
> Keeping all that data in one table means you will have to lock it for all
> your operations, preventing parallelism.  Keeping all that data in one table
> means it will always get larger and slower.  Keeping all that data in one
> table means conflicting optimization goals, and thus little optimization.
>
> I would suggest breaking your data up into a number of tables, such that the
> purpose of each would be more focused and amenable to optimizing for reading
> or writing -- for example, you could have one table of "needs", one of
> "queued", one of "downloading" and one of "downloaded", moving the data
> along from table to table in batches.  Thus, at any given moment, your
> "needs" table would only contain 1.5 million rows, rather than 200 million
> -- it will scale with your "current workload" rather than having to scale
> with "all the work you've ever done".
>
> One could suggest having separate queue tables set up for each of your
> downloading systems.  Thus your main "find work to do" query, which has the
> logic in it to avoid having too many URLs for a single target, would query
> the 1.5 million row "needs" table and move rows into the queue table
> associated with a downloader -- the downloader would simply need to perform
> a trivial "select" against those few-hundred/few-thousand rows,
> nigh-instantly getting its fresh load of work.  As data is downloaded, each
> downloader could move rows from its own queue table to the ~200 million row
> "downloaded" table.
>
> Since every downloader would have its own table, they would not conflict on
> locks there.  The write locks to insert into the downloaded table would not
> conflict with any read locks, as you wouldn't read from that table to find
> work to do.  Indeed, you should avoid having any read-friendly indexes on
> that table -- by having it be non-indexed, inserting data into it will be as
> fast as possible.
>
> All of these steps could be made further more efficient by using batched
> queries and appropriate units of work -- for example, if a queue table held
> a hundred URLs, the downloading system could refrain from moving them into
> the downloaded table until they were all complete -- thus it would only need
> one "insert from"/"delete" to move those hundred records in one transaction
> -- and it would not have to actually send any URLs over the network back to
> the database. A further efficiency could be gained by allotting two queue
> tables to each downloader, such that at any given moment, the downloader was
> working to process one of them, while the queue-master was working to find
> work to fill the other.
>
> If you already have a significant investment in logic within this database,
> you could leverage that by using procedures & cursors to fill queue tables
> and switch queue tables while processing the flow of results from your "find
> work to do" logic -- cursors will generally get you access to the results as
> soon as possible, before the query is complete.
>
> By using separate tables, you could also migrate to such a system without
> impacting the current one, at the slight costs of some brief redundancy or
> the use of views -- depends on if you would rather consume time or space --
> but essentially your database would work both ways simultaneously for the
> crossover.
>
> You could also split your "downloaded" table up in to multiple
> per-month/per-downloader/per-something tables -- if you have per-downloader
> tables here, that would avoid write-lock conflicts entirely -- and at any
> point where you want to select data out, a "union" across all tables, or
> periodic movement of the data into a single master table would suffice.
>
> Fundamentally, you want to put the data into tables which have indexes (or
> not) and access patterns which will be efficient for the role of the data at
> that point in time, and not simply leave it all in one pile of conflicting
> indexes & locks.
>
>> [...]
>
> Various other comments have mentioned ways to distribute the workload once
> you get it out of the database.  In short, that would be a mistake.  What
> that is basically suggesting is that you have a separate "table" of your
> active work, but that rather than keeping it in the database, where it may
> be extremely efficient, you slog it around on slow network connections and
> convert it into heavy Perl data structures. It is to suggest that you accept
> abysmal performance of the database as an unalterable facet of the natural
> world.
>
> While SQL servers have numerous flaws, they are, if used remotely properly,
> brilliant at facilitating parallel access to data -- you should be able to
> make this work very efficiently with just such a database if you structure
> your data for efficiency.
>
>
> _______________________________________________
> Boston-pm mailing list
> [email protected]
> http://mail.pm.org/mailman/listinfo/boston-pm

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Passing large complex data structures between process

Reply via email to