Re: Issues about Partitioning and Record converter

Edward J. Yoon Mon, 06 May 2013 11:53:16 -0700

I think it was misunderstanding of the term of 'remove' and 'record converter'.


PartitioningRunner converts records. I call this as a 'record
converter'. But, there's no need to write converted records in
PartitioningRunner. Partitioner is just a partitioner in BSP core
module.

On Tue, May 7, 2013 at 3:46 AM, Edward J. Yoon <[email protected]> wrote:
> Currently, the PartitioningRunner writes converted records to
> partition file, then GraphJobRunner reads VertexWritable, NullWritable
> K/V records. In other words,
>
> 1) input record: 'a\tb\tc'  // assume that input is Text
> 2) partition files: the sequence of Vertex writable
> 3) GraphJobRunner.loadVertices() reads sequence format partition file.
>
> My suggestion is, just writes raw records to partition file in
> PartitioningRunner.
>
> 1) input record: 'a\tb\tc'  // assume that input is Text
> 2) partition files: 'a\tb\tc'  // data shuffled by partition ID, but
> format is the same with original.
> 3) GraphJobRunner.loadVertices() reads records from assigned
> partition, and parse Vertex structure.
>
> Few lines will be changed.
>
> Why? As I described in Wiki, NoSQLs table input case (which supports
> range or random access by sorted key), there's no need to
> re-partitioning. Because they are already range partitioned. It means
> that Parsing vertex structure is needed at GraphJobRunner.
>
> With or without Suraj's suggestion, parsing vertex structure should be
> done at GraphJobRunner.loadVertices() method to prepare the NoSQLs
> input format.
>
> Can you understand?
>
>
> On Tue, May 7, 2013 at 2:55 AM, Tommaso Teofili
> <[email protected]> wrote:
>> 2013/5/6 Edward J. Yoon <[email protected]>
>>
>>> > - Instead of running a separate job, we inject a partitioning superstep
>>> > before the first superstep of the job. (This has a dependency on the
>>> > Superstep API)
>>> > - The partitions instead of being written to HDFS, which is creating a
>>> copy
>>> > of input files in HDFS Cluster (too costly I believe), should be written
>>> to
>>> > local files and read from.
>>> > - For graph jobs, we can configure this partitioning superstep class
>>> > specific to graph partitioning class that partitions and loads vertices.
>>>
>>> I believe that above suggestion can be a future improvement task.
>>>
>>> > This sure has some dependencies. But would be a graceful solution and can
>>> > tackle every problem. This is what I want to achieve in the end. Please
>>> > proceed if you have any intermediate ways to reach here faster.
>>>
>>> If you understand my plan now, Please let me know so that I can start
>>> the work. My patch will change only few lines.
>>>
>>
>> while to me it's clear what Suraj's proposal is, I'm not completely sure
>> about what your final proposal would be, could you explain that in more
>> detail (or otherwise perhaps a path to review it's enough) ?
>>
>>
>>>
>>> Finally, I think now we can prepare the integration with NoSQLs table
>>> input format.
>>>
>>
>> as I said, I'd like to have a broad consensus before doing any significant
>> change to core stuff.
>>
>> thanks,
>> Tommaso
>>
>> p.s.:
>> probably worth a different thread: what's the NoSQL usage scenario with
>> regard to Hama?
>>
>>
>>
>>>
>>> On Tue, May 7, 2013 at 2:01 AM, Suraj Menon <[email protected]>
>>> wrote:
>>> > I am assuming that the storage of vertices (NoSQL or any other format)
>>> need
>>> > not be updated after every iteration.
>>> >
>>> > Based on the above assumption, I have the following suggestions:
>>> >
>>> > - Instead of running a separate job, we inject a partitioning superstep
>>> > before the first superstep of the job. (This has a dependency on the
>>> > Superstep API)
>>> > - The partitions instead of being written to HDFS, which is creating a
>>> copy
>>> > of input files in HDFS Cluster (too costly I believe), should be written
>>> to
>>> > local files and read from.
>>> > - For graph jobs, we can configure this partitioning superstep class
>>> > specific to graph partitioning class that partitions and loads vertices.
>>> >
>>> > This sure has some dependencies. But would be a graceful solution and can
>>> > tackle every problem. This is what I want to achieve in the end. Please
>>> > proceed if you have any intermediate ways to reach here faster.
>>> >
>>> > Regards,
>>> > Suraj
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected]
>>> >wrote:
>>> >
>>> >> P.S., BSPJob (with table input) also the same. It's not only for
>>> GraphJob.
>>> >>
>>> >> On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]>
>>> >> wrote:
>>> >> > All,
>>> >> >
>>> >> > I've also roughly described details about design of Graph APIs[1]. To
>>> >> > reduce our misunderstandings (please read first Partitioning and
>>> >> > GraphModuleInternals documents),
>>> >> >
>>> >> >  * In NoSQLs case, there's obviously no need to Hash-partitioning or
>>> >> > rewrite partition files on HDFS. So, in these input cases, I think
>>> >> > vertex structure should be parsed at GraphJobRunner.loadVertices()
>>> >> > method.
>>> >> >
>>> >> > At here, we faced two options: 1) The current implementation of
>>> >> > 'PartitioningRunner' writes converted vertices on sequence format
>>> >> > partition files. And GraphJobRunner reads only Vertex Writable
>>> >> > objects. If input is table, we maybe have to skip the Partitioning job
>>> >> > and have to parse vertex structure at loadVertices() method after
>>> >> > checking some conditions. 2) PartitioningRunner just writes raw
>>> >> > records to proper partition files after checking its partition ID. And
>>> >> > GraphJobRunner.loadVertices() always parses and loads vertices.
>>> >> >
>>> >> > I was mean that I prefer the latter and there's no need to write
>>> >> > VertexWritable files. It's not related whether graph will support only
>>> >> > Seq format or not. Hope my explanation is enough!
>>> >> >
>>> >> > 1. http://wiki.apache.org/hama/GraphModuleInternals
>>> >> >
>>> >> > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <
>>> [email protected]>
>>> >> wrote:
>>> >> >> I've described my big picture here:
>>> >> http://wiki.apache.org/hama/Partitioning
>>> >> >>
>>> >> >> Please review and feedback whether this is acceptable.
>>> >> >>
>>> >> >>
>>> >> >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote:
>>> >> >>> p.s., i think theres mis understand. it doesn't mean that graph will
>>> >> support only sequence file format. Main is whether converting at
>>> >> patitioning stage or loadVertices stage.
>>> >> >>>
>>> >> >>> Sent from my iPhone
>>> >> >>>
>>> >> >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]>
>>> wrote:
>>> >> >>>
>>> >> >>>> Sure, Please go ahead.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <
>>> [email protected]
>>> >> >wrote:
>>> >> >>>>
>>> >> >>>>>>> Please let me know before this is changed, I would like to work
>>> on
>>> >> a
>>> >> >>>>>>> separate branch.
>>> >> >>>>>
>>> >> >>>>> I personally, we have to focus on high priority tasks. and more
>>> >> >>>>> feedbacks and contributions from users. So, if changes made, I'll
>>> >> >>>>> release periodically. If you want to work on another place, please
>>> >> do.
>>> >> >>>>> I don't want to wait your patches.
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon <
>>> >> [email protected]>
>>> >> >>>>> wrote:
>>> >> >>>>>> For preparing integration with NoSQLs, of course, maybe condition
>>> >> >>>>>> check (whether converted or not) can be used without removing
>>> record
>>> >> >>>>>> converter.
>>> >> >>>>>>
>>> >> >>>>>> We need to discuss everything.
>>> >> >>>>>>
>>> >> >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <
>>> [email protected]
>>> >> >
>>> >> >>>>> wrote:
>>> >> >>>>>>> I am still -1 if this means our graph module can work only on
>>> >> sequential
>>> >> >>>>>>> file format.
>>> >> >>>>>>> Please note that you can set record converter to null and make
>>> >> changes
>>> >> >>>>> to
>>> >> >>>>>>> loadVertices for what you desire here.
>>> >> >>>>>>>
>>> >> >>>>>>> If we came to this design, because TextInputFormat is
>>> inefficient,
>>> >> would
>>> >> >>>>>>> this work for Avro or Thrift input format?
>>> >> >>>>>>> Please let me know before this is changed, I would like to work
>>> on
>>> >> a
>>> >> >>>>>>> separate branch.
>>> >> >>>>>>> You may proceed as you wish.
>>> >> >>>>>>>
>>> >> >>>>>>> Regards,
>>> >> >>>>>>> Suraj
>>> >> >>>>>>>
>>> >> >>>>>>>
>>> >> >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon <
>>> >> [email protected]
>>> >> >>>>>> wrote:
>>> >> >>>>>>>
>>> >> >>>>>>>> I think 'record converter' should be removed. It's not good
>>> idea.
>>> >> >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input
>>> >> reader, we
>>> >> >>>>>>>> can move related classes into common module.
>>> >> >>>>>>>>
>>> >> >>>>>>>> Let's go with my original plan.
>>> >> >>>>>>>>
>>> >> >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon <
>>> >> [email protected]>
>>> >> >>>>>>>> wrote:
>>> >> >>>>>>>>> Hi all,
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> I'm reading our old discussions about record converter,
>>> superstep
>>> >> >>>>>>>>> injection, and common module:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
>>> >> >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> To clarify goals and objectives:
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> 1. A parallel input partition is necessary for obtaining
>>> >> scalability
>>> >> >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's
>>> >> not a
>>> >> >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please
>>> don't
>>> >> >>>>>>>>> shake).
>>> >> >>>>>>>>> 2. Input partitioning should be handled at BSP framework
>>> level,
>>> >> and
>>> >> >>>>> it
>>> >> >>>>>>>>> is for every Hama jobs, not only for Graph jobs.
>>> >> >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs
>>> input
>>> >> also
>>> >> >>>>>>>>> should be considered.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> The current problem is that every input of graph jobs should
>>> be
>>> >> >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me
>>> know.
>>> >> >>>>>>>>>
>>> >> >>>>>>>>> --
>>> >> >>>>>>>>> Best Regards, Edward J. Yoon
>>> >> >>>>>>>>> @eddieyoon
>>> >> >>>>>>>>
>>> >> >>>>>>>>
>>> >> >>>>>>>>
>>> >> >>>>>>>> --
>>> >> >>>>>>>> Best Regards, Edward J. Yoon
>>> >> >>>>>>>> @eddieyoon
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>>
>>> >> >>>>>> --
>>> >> >>>>>> Best Regards, Edward J. Yoon
>>> >> >>>>>> @eddieyoon
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>>
>>> >> >>>>> --
>>> >> >>>>> Best Regards, Edward J. Yoon
>>> >> >>>>> @eddieyoon
>>> >> >>>>>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Best Regards, Edward J. Yoon
>>> >> >> @eddieyoon
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Best Regards, Edward J. Yoon
>>> >> > @eddieyoon
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Best Regards, Edward J. Yoon
>>> >> @eddieyoon
>>> >>
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Issues about Partitioning and Record converter

Reply via email to