Eli Reisman created GIRAPH-308:
----------------------------------
Summary: Giraph consistently creates 10% more InputSplits than one
would expect
Key: GIRAPH-308
URL: https://issues.apache.org/jira/browse/GIRAPH-308
Project: Giraph
Issue Type: Bug
Components: graph
Affects Versions: 0.2.0
Reporter: Eli Reisman
Priority: Minor
Fix For: 0.2.0
As I have been doing a lot of instrumented runs for scale out, and to test 246
and 301 (among other patches) I have seen the the calculation:
(# of MB in input files) / (giraph.splitmb setting) == # of InputSplits to
expect
is not arriving at the number of splits one would expect. I would think there
would be an extra now and then to round off fractional amounts in a calculation
such as the one stated above, but I'm consistently seeing more than that,
roughly 10% more than one would expect and this is consistent over runs with
many different size data loads.
If there is some simple explanation, perhaps I'll find it in the code but
either way I wanted to post a JIRA because this is somewhat counterintuitive
and suggests we should alter the behavior of giraph.splitmb to ensure users get
what they expect in terms of input splits. In memory scarcity use cases, I am
finding that if a given worker reads just one split too many on a given data
load, it will overload and fail. Knowing how many workers to allocate for a
given data load with some precision has been the key to scale out under scarce
resources here. Seeing these numbers now as I test 301 (which is meant to help
ensure the split-reading load is spread out evenly among workers) I see this
has fooled me at times in the past when setting -w and -Dgiraph.splitmb options
carefully.
At the very least, it would be nice to hear from someone that knows whats going
on here what the deal is so there is a definitive posting on this matter that
folks can refer to for information in the future when exploring a use case like
mine. Many users here will be in the same boat as me, of course :)
Thanks in advance.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira