Re: Issues about Partitioning and Record converter

Edward J. Yoon Mon, 06 May 2013 10:39:27 -0700

> - Instead of running a separate job, we inject a partitioning superstep
> before the first superstep of the job. (This has a dependency on the
> Superstep API)
> - The partitions instead of being written to HDFS, which is creating a copy
> of input files in HDFS Cluster (too costly I believe), should be written to
> local files and read from.
> - For graph jobs, we can configure this partitioning superstep class
> specific to graph partitioning class that partitions and loads vertices.


I believe that above suggestion can be a future improvement task.

> This sure has some dependencies. But would be a graceful solution and can
> tackle every problem. This is what I want to achieve in the end. Please
> proceed if you have any intermediate ways to reach here faster.

If you understand my plan now, Please let me know so that I can start
the work. My patch will change only few lines.

Finally, I think now we can prepare the integration with NoSQLs table
input format.

On Tue, May 7, 2013 at 2:01 AM, Suraj Menon <[email protected]> wrote:
> I am assuming that the storage of vertices (NoSQL or any other format) need
> not be updated after every iteration.
>
> Based on the above assumption, I have the following suggestions:
>
> - Instead of running a separate job, we inject a partitioning superstep
> before the first superstep of the job. (This has a dependency on the
> Superstep API)
> - The partitions instead of being written to HDFS, which is creating a copy
> of input files in HDFS Cluster (too costly I believe), should be written to
> local files and read from.
> - For graph jobs, we can configure this partitioning superstep class
> specific to graph partitioning class that partitions and loads vertices.
>
> This sure has some dependencies. But would be a graceful solution and can
> tackle every problem. This is what I want to achieve in the end. Please
> proceed if you have any intermediate ways to reach here faster.
>
> Regards,
> Suraj
>
>
>
>
> On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected]>wrote:
>
>> P.S., BSPJob (with table input) also the same. It's not only for GraphJob.
>>
>> On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]>
>> wrote:
>> > All,
>> >
>> > I've also roughly described details about design of Graph APIs[1]. To
>> > reduce our misunderstandings (please read first Partitioning and
>> > GraphModuleInternals documents),
>> >
>> >  * In NoSQLs case, there's obviously no need to Hash-partitioning or
>> > rewrite partition files on HDFS. So, in these input cases, I think
>> > vertex structure should be parsed at GraphJobRunner.loadVertices()
>> > method.
>> >
>> > At here, we faced two options: 1) The current implementation of
>> > 'PartitioningRunner' writes converted vertices on sequence format
>> > partition files. And GraphJobRunner reads only Vertex Writable
>> > objects. If input is table, we maybe have to skip the Partitioning job
>> > and have to parse vertex structure at loadVertices() method after
>> > checking some conditions. 2) PartitioningRunner just writes raw
>> > records to proper partition files after checking its partition ID. And
>> > GraphJobRunner.loadVertices() always parses and loads vertices.
>> >
>> > I was mean that I prefer the latter and there's no need to write
>> > VertexWritable files. It's not related whether graph will support only
>> > Seq format or not. Hope my explanation is enough!
>> >
>> > 1. http://wiki.apache.org/hama/GraphModuleInternals
>> >
>> > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <[email protected]>
>> wrote:
>> >> I've described my big picture here:
>> http://wiki.apache.org/hama/Partitioning
>> >>
>> >> Please review and feedback whether this is acceptable.
>> >>
>> >>
>> >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote:
>> >>> p.s., i think theres mis understand. it doesn't mean that graph will
>> support only sequence file format. Main is whether converting at
>> patitioning stage or loadVertices stage.
>> >>>
>> >>> Sent from my iPhone
>> >>>
>> >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]> wrote:
>> >>>
>> >>>> Sure, Please go ahead.
>> >>>>
>> >>>>
>> >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <[email protected]
>> >wrote:
>> >>>>
>> >>>>>>> Please let me know before this is changed, I would like to work on
>> a
>> >>>>>>> separate branch.
>> >>>>>
>> >>>>> I personally, we have to focus on high priority tasks. and more
>> >>>>> feedbacks and contributions from users. So, if changes made, I'll
>> >>>>> release periodically. If you want to work on another place, please
>> do.
>> >>>>> I don't want to wait your patches.
>> >>>>>
>> >>>>>
>> >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon <
>> [email protected]>
>> >>>>> wrote:
>> >>>>>> For preparing integration with NoSQLs, of course, maybe condition
>> >>>>>> check (whether converted or not) can be used without removing record
>> >>>>>> converter.
>> >>>>>>
>> >>>>>> We need to discuss everything.
>> >>>>>>
>> >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <[email protected]
>> >
>> >>>>> wrote:
>> >>>>>>> I am still -1 if this means our graph module can work only on
>> sequential
>> >>>>>>> file format.
>> >>>>>>> Please note that you can set record converter to null and make
>> changes
>> >>>>> to
>> >>>>>>> loadVertices for what you desire here.
>> >>>>>>>
>> >>>>>>> If we came to this design, because TextInputFormat is inefficient,
>> would
>> >>>>>>> this work for Avro or Thrift input format?
>> >>>>>>> Please let me know before this is changed, I would like to work on
>> a
>> >>>>>>> separate branch.
>> >>>>>>> You may proceed as you wish.
>> >>>>>>>
>> >>>>>>> Regards,
>> >>>>>>> Suraj
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon <
>> [email protected]
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> I think 'record converter' should be removed. It's not good idea.
>> >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input
>> reader, we
>> >>>>>>>> can move related classes into common module.
>> >>>>>>>>
>> >>>>>>>> Let's go with my original plan.
>> >>>>>>>>
>> >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon <
>> [email protected]>
>> >>>>>>>> wrote:
>> >>>>>>>>> Hi all,
>> >>>>>>>>>
>> >>>>>>>>> I'm reading our old discussions about record converter, superstep
>> >>>>>>>>> injection, and common module:
>> >>>>>>>>>
>> >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
>> >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
>> >>>>>>>>>
>> >>>>>>>>> To clarify goals and objectives:
>> >>>>>>>>>
>> >>>>>>>>> 1. A parallel input partition is necessary for obtaining
>> scalability
>> >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's
>> not a
>> >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please don't
>> >>>>>>>>> shake).
>> >>>>>>>>> 2. Input partitioning should be handled at BSP framework level,
>> and
>> >>>>> it
>> >>>>>>>>> is for every Hama jobs, not only for Graph jobs.
>> >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs input
>> also
>> >>>>>>>>> should be considered.
>> >>>>>>>>>
>> >>>>>>>>> The current problem is that every input of graph jobs should be
>> >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me know.
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Best Regards, Edward J. Yoon
>> >>>>>>>>> @eddieyoon
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> Best Regards, Edward J. Yoon
>> >>>>>>>> @eddieyoon
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Best Regards, Edward J. Yoon
>> >>>>>> @eddieyoon
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Best Regards, Edward J. Yoon
>> >>>>> @eddieyoon
>> >>>>>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards, Edward J. Yoon
>> >> @eddieyoon
>> >
>> >
>> >
>> > --
>> > Best Regards, Edward J. Yoon
>> > @eddieyoon
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



--
Best Regards, Edward J. Yoon
@eddieyoon

Re: Issues about Partitioning and Record converter

Reply via email to