Thank you very much Gilberto!

It's great to make contact with people out there who are on the same boat.
I've just been watching a series of videos on pipelines, and I'm starting 
to get the pattern for big data processing that Google promotes:

Datastore -> Cloud Storage -> BigQuery

The key point is that BigQuery is "append only", something I didn't realize 
before.
Here are the videos:

   1. Google I/O 2012 - Building Data Pipelines at Google Scale: 
   http://youtu.be/lqQ6VFd3Tnw 
   2. BigQuery: Simple example of a data collection and analysis pipeline + 
   Yo...: http://youtu.be/btJE659h5Bg
   3. GCP Cloud Platform Integration Demo: http://youtu.be/JcOEJXopmgo via 
   @YouTube
   
All I need it seems is the Pipeline API, iterating over the Datastore (I 
guess, in order with a query) and producing a CSV (and other formats) 
output.
That should allow me to do what I do already but on top of multiple 
(perhaps sequential) task queues, rather than just one.

>From the point of view of costs, currently I heavily rely on, possibly 
abusing, memcache. With no memcache, I expect costs to go up.
A further improvement would be to update only subsets of data, rather than 
a whole lot. I've been designing a new datastore 'schema' so that my data 
is hierarchically organized in entity groups, therefore I could generate a 
file per entity group (once that's changed) and have a final stage that 
assembles those files together.
I'm pretty happy with my current task because, as I wrote, is simple and 
elegant.
If I could upgrade the same algorithm to a Datastore input reader for 
pipelines, that should do for us.

Emanuele

On Friday, 12 December 2014 02:29:54 UTC+13, Gilberto Torrezan Filho wrote:
>
> I've used MapReduce myself for while, and I can say it to you: 100+MB of 
> keys means A LOT of keys at the shuffle stage. And the real limitations of 
> MapReduce are:
>
> "The total size of all the instances of Mapper, InputReader, OutputWriter 
> and Counters must be less than 1MB between slices. This is because these 
> instances are serialized and saved to the datastore between slices."
>
> Source 
> <https://github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/2.6-The-Java-MapReduce-Library>
>
> The real problem of MapReduce, in my opinion, is the latency of the 
> operations and huge amount of read/writes to the datastore to maintain the 
> things working between slices (which considerably increases costs). You 
> can't rely on MapReduce to do real time or near real time work as you could 
> with pure tasks queues. And it actually only shines when you can afford a 
> large number of machines to run your logic - running MapReduce in few 
> machines is sometimes worse than pure sequential brute force.
>
> Fitting your problem in a MapReduce process is actually good for your code 
> - even if you don't use the library itself. It forces you to think on how 
> can you split your huge tasks into smaller, more manageable and more 
> scalable pieces. It's a good exercise - sometimes you think you can't 
> parallelize your problem, but when you're forced to the MapReduce workflow, 
> you might find you were actually wrong, and by the end of the day you have 
> a better code.
>
> On Wednesday, December 10, 2014 6:22:17 PM UTC-2, Emanuele Ziglioli wrote:
>>
>> It all comes at a cost: increased complexity.
>> You can't beat the simplicity of task queues and the 10m limit seems 
>> artificially imposed to me. I mean, we pay for CPU time, as we would pay 
>> for 20m, 30m, 1h tasks.
>> I've got a simple task that takes a long time, looping through hundreds 
>> of thousands of rows to produced ordered files in output.
>> The current code is simple and elegant but I have to keep increasing the 
>> CPU size in order to finish the task within 10m.
>> A solution could be using MapReuce, but I haven't figured out yet how 
>> MapReduce would solve my problem without hitting the memory limit: with my 
>> simple task there are only 1000 rows in memory at any given time (of 
>> course, minus the GC). A MapReduce shuffle stage would require all 
>> entities, or at least their keys, to be kept in memory, and that's 
>> impossible with F1s or F2s.
>>
>> Emanuele
>>
>> On Wednesday, 10 December 2014 19:24:30 UTC+13, Vinny P wrote:
>>>
>>> On Sat, Dec 6, 2014 at 5:58 AM, Maneesh Tripathi <
>>> [email protected]> wrote:
>>>
>>>> I have Created an task queue which stop working after 10 Min.
>>>> I want to increase the timing. 
>>>> Please help me on this 
>>>>
>>>
>>>
>>> Task queue requests are limited to 10 minutes of execution time: 
>>> https://cloud.google.com/appengine/docs/java/taskqueue/overview-push#task_deadlines
>>>
>>> If you need to go past the 10 minute deadline, you're better off using a 
>>> manual or basic scaled module: 
>>> https://cloud.google.com/appengine/docs/java/modules/#scaling_types
>>>
>>>  
>>> -----------------
>>> -Vinny P
>>> Technology & Media Consultant
>>> Chicago, IL
>>>
>>> App Engine Code Samples: http://www.learntogoogleit.com
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/google-appengine.
For more options, visit https://groups.google.com/d/optout.

Reply via email to