Re: Issues about Partitioning and Record converter

Edward J. Yoon Mon, 06 May 2013 17:38:56 -0700

Do you need also a separated Wiki? :-) If not, please feel free to
describe your ideas on Wiki, dividing short-term/long-term plans.


On Tue, May 7, 2013 at 4:35 AM, Edward J. Yoon <[email protected]> wrote:
> 1. Graph/Matrix data is small but Graph/Matrix algo requires huge
> computations. Hence, the number of BSP processors should be able to
> adjust ( != file blocks).
>
> 2. I'm -1 for using local disk to store partitions. HDFS is high cost.
> But, reuse of partitions should be considered.
>
> On Tue, May 7, 2013 at 2:08 AM, Tommaso Teofili
> <[email protected]> wrote:
>> 2013/5/6 Suraj Menon <[email protected]>
>>
>>> I am assuming that the storage of vertices (NoSQL or any other format) need
>>> not be updated after every iteration.
>>>
>>> Based on the above assumption, I have the following suggestions:
>>>
>>> - Instead of running a separate job, we inject a partitioning superstep
>>> before the first superstep of the job. (This has a dependency on the
>>> Superstep API)
>>>
>>
>> could we do that without introducing that dependency? I mean would that
>> work also if not using the Superstep API on the client side?
>>
>>
>>> - The partitions instead of being written to HDFS, which is creating a copy
>>> of input files in HDFS Cluster (too costly I believe), should be written to
>>> local files and read from.
>>>
>>
>> +1
>>
>>
>>> - For graph jobs, we can configure this partitioning superstep class
>>> specific to graph partitioning class that partitions and loads vertices.
>>>
>>
>> this seems to be inline with the above assumption thus it probably makes
>> sense.
>>
>>
>>>
>>> This sure has some dependencies. But would be a graceful solution and can
>>> tackle every problem. This is what I want to achieve in the end. Please
>>> proceed if you have any intermediate ways to reach here faster.
>>>
>>
>> Your solution sounds good to me generally, better if we can avoid the
>> dependency, but still ok if not.
>> Let's collect also others' opinions and try to reach a shared consensus.
>>
>> Tommaso
>>
>>
>>
>>
>>>
>>> Regards,
>>> Suraj
>>>
>>>
>>>
>>>
>>> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected]
>>> >wrote:
>>>
>>> > P.S., BSPJob (with table input) also the same. It's not only for
>>> GraphJob.
>>> >
>>> > On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]>
>>> > wrote:
>>> > > All,
>>> > >
>>> > > I've also roughly described details about design of Graph APIs[1]. To
>>> > > reduce our misunderstandings (please read first Partitioning and
>>> > > GraphModuleInternals documents),
>>> > >
>>> > >  * In NoSQLs case, there's obviously no need to Hash-partitioning or
>>> > > rewrite partition files on HDFS. So, in these input cases, I think
>>> > > vertex structure should be parsed at GraphJobRunner.loadVertices()
>>> > > method.
>>> > >
>>> > > At here, we faced two options: 1) The current implementation of
>>> > > 'PartitioningRunner' writes converted vertices on sequence format
>>> > > partition files. And GraphJobRunner reads only Vertex Writable
>>> > > objects. If input is table, we maybe have to skip the Partitioning job
>>> > > and have to parse vertex structure at loadVertices() method after
>>> > > checking some conditions. 2) PartitioningRunner just writes raw
>>> > > records to proper partition files after checking its partition ID. And
>>> > > GraphJobRunner.loadVertices() always parses and loads vertices.
>>> > >
>>> > > I was mean that I prefer the latter and there's no need to write
>>> > > VertexWritable files. It's not related whether graph will support only
>>> > > Seq format or not. Hope my explanation is enough!
>>> > >
>>> > > 1. http://wiki.apache.org/hama/GraphModuleInternals
>>> > >
>>> > > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <[email protected]
>>> >
>>> > wrote:
>>> > >> I've described my big picture here:
>>> > http://wiki.apache.org/hama/Partitioning
>>> > >>
>>> > >> Please review and feedback whether this is acceptable.
>>> > >>
>>> > >>
>>> > >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote:
>>> > >>> p.s., i think theres mis understand. it doesn't mean that graph will
>>> > support only sequence file format. Main is whether converting at
>>> > patitioning stage or loadVertices stage.
>>> > >>>
>>> > >>> Sent from my iPhone
>>> > >>>
>>> > >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]>
>>> wrote:
>>> > >>>
>>> > >>>> Sure, Please go ahead.
>>> > >>>>
>>> > >>>>
>>> > >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <
>>> [email protected]
>>> > >wrote:
>>> > >>>>
>>> > >>>>>>> Please let me know before this is changed, I would like to work
>>> on
>>> > a
>>> > >>>>>>> separate branch.
>>> > >>>>>
>>> > >>>>> I personally, we have to focus on high priority tasks. and more
>>> > >>>>> feedbacks and contributions from users. So, if changes made, I'll
>>> > >>>>> release periodically. If you want to work on another place, please
>>> > do.
>>> > >>>>> I don't want to wait your patches.
>>> > >>>>>
>>> > >>>>>
>>> > >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon <
>>> > [email protected]>
>>> > >>>>> wrote:
>>> > >>>>>> For preparing integration with NoSQLs, of course, maybe condition
>>> > >>>>>> check (whether converted or not) can be used without removing
>>> record
>>> > >>>>>> converter.
>>> > >>>>>>
>>> > >>>>>> We need to discuss everything.
>>> > >>>>>>
>>> > >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <
>>> [email protected]
>>> > >
>>> > >>>>> wrote:
>>> > >>>>>>> I am still -1 if this means our graph module can work only on
>>> > sequential
>>> > >>>>>>> file format.
>>> > >>>>>>> Please note that you can set record converter to null and make
>>> > changes
>>> > >>>>> to
>>> > >>>>>>> loadVertices for what you desire here.
>>> > >>>>>>>
>>> > >>>>>>> If we came to this design, because TextInputFormat is
>>> inefficient,
>>> > would
>>> > >>>>>>> this work for Avro or Thrift input format?
>>> > >>>>>>> Please let me know before this is changed, I would like to work
>>> on
>>> > a
>>> > >>>>>>> separate branch.
>>> > >>>>>>> You may proceed as you wish.
>>> > >>>>>>>
>>> > >>>>>>> Regards,
>>> > >>>>>>> Suraj
>>> > >>>>>>>
>>> > >>>>>>>
>>> > >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon <
>>> > [email protected]
>>> > >>>>>> wrote:
>>> > >>>>>>>
>>> > >>>>>>>> I think 'record converter' should be removed. It's not good
>>> idea.
>>> > >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input
>>> > reader, we
>>> > >>>>>>>> can move related classes into common module.
>>> > >>>>>>>>
>>> > >>>>>>>> Let's go with my original plan.
>>> > >>>>>>>>
>>> > >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon <
>>> > [email protected]>
>>> > >>>>>>>> wrote:
>>> > >>>>>>>>> Hi all,
>>> > >>>>>>>>>
>>> > >>>>>>>>> I'm reading our old discussions about record converter,
>>> superstep
>>> > >>>>>>>>> injection, and common module:
>>> > >>>>>>>>>
>>> > >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
>>> > >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
>>> > >>>>>>>>>
>>> > >>>>>>>>> To clarify goals and objectives:
>>> > >>>>>>>>>
>>> > >>>>>>>>> 1. A parallel input partition is necessary for obtaining
>>> > scalability
>>> > >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's
>>> > not a
>>> > >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please don't
>>> > >>>>>>>>> shake).
>>> > >>>>>>>>> 2. Input partitioning should be handled at BSP framework level,
>>> > and
>>> > >>>>> it
>>> > >>>>>>>>> is for every Hama jobs, not only for Graph jobs.
>>> > >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs
>>> input
>>> > also
>>> > >>>>>>>>> should be considered.
>>> > >>>>>>>>>
>>> > >>>>>>>>> The current problem is that every input of graph jobs should be
>>> > >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me know.
>>> > >>>>>>>>>
>>> > >>>>>>>>> --
>>> > >>>>>>>>> Best Regards, Edward J. Yoon
>>> > >>>>>>>>> @eddieyoon
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>>
>>> > >>>>>>>> --
>>> > >>>>>>>> Best Regards, Edward J. Yoon
>>> > >>>>>>>> @eddieyoon
>>> > >>>>>>
>>> > >>>>>>
>>> > >>>>>>
>>> > >>>>>> --
>>> > >>>>>> Best Regards, Edward J. Yoon
>>> > >>>>>> @eddieyoon
>>> > >>>>>
>>> > >>>>>
>>> > >>>>>
>>> > >>>>> --
>>> > >>>>> Best Regards, Edward J. Yoon
>>> > >>>>> @eddieyoon
>>> > >>>>>
>>> > >>
>>> > >>
>>> > >>
>>> > >> --
>>> > >> Best Regards, Edward J. Yoon
>>> > >> @eddieyoon
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Best Regards, Edward J. Yoon
>>> > > @eddieyoon
>>> >
>>> >
>>> >
>>> > --
>>> > Best Regards, Edward J. Yoon
>>> > @eddieyoon
>>> >
>>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Issues about Partitioning and Record converter

Reply via email to