Re: Issues about Partitioning and Record converter

Suraj Menon Mon, 06 May 2013 10:01:03 -0700

I am assuming that the storage of vertices (NoSQL or any other format) need
not be updated after every iteration.


Based on the above assumption, I have the following suggestions:

- Instead of running a separate job, we inject a partitioning superstep
before the first superstep of the job. (This has a dependency on the
Superstep API)
- The partitions instead of being written to HDFS, which is creating a copy
of input files in HDFS Cluster (too costly I believe), should be written to
local files and read from.
- For graph jobs, we can configure this partitioning superstep class
specific to graph partitioning class that partitions and loads vertices.

This sure has some dependencies. But would be a graceful solution and can
tackle every problem. This is what I want to achieve in the end. Please
proceed if you have any intermediate ways to reach here faster.

Regards,
Suraj




On Mon, May 6, 2013 at 3:14 AM, Edward J. Yoon <[email protected]>wrote:

> P.S., BSPJob (with table input) also the same. It's not only for GraphJob.
>
> On Mon, May 6, 2013 at 4:09 PM, Edward J. Yoon <[email protected]>
> wrote:
> > All,
> >
> > I've also roughly described details about design of Graph APIs[1]. To
> > reduce our misunderstandings (please read first Partitioning and
> > GraphModuleInternals documents),
> >
> >  * In NoSQLs case, there's obviously no need to Hash-partitioning or
> > rewrite partition files on HDFS. So, in these input cases, I think
> > vertex structure should be parsed at GraphJobRunner.loadVertices()
> > method.
> >
> > At here, we faced two options: 1) The current implementation of
> > 'PartitioningRunner' writes converted vertices on sequence format
> > partition files. And GraphJobRunner reads only Vertex Writable
> > objects. If input is table, we maybe have to skip the Partitioning job
> > and have to parse vertex structure at loadVertices() method after
> > checking some conditions. 2) PartitioningRunner just writes raw
> > records to proper partition files after checking its partition ID. And
> > GraphJobRunner.loadVertices() always parses and loads vertices.
> >
> > I was mean that I prefer the latter and there's no need to write
> > VertexWritable files. It's not related whether graph will support only
> > Seq format or not. Hope my explanation is enough!
> >
> > 1. http://wiki.apache.org/hama/GraphModuleInternals
> >
> > On Mon, May 6, 2013 at 10:00 AM, Edward J. Yoon <[email protected]>
> wrote:
> >> I've described my big picture here:
> http://wiki.apache.org/hama/Partitioning
> >>
> >> Please review and feedback whether this is acceptable.
> >>
> >>
> >> On Mon, May 6, 2013 at 8:18 AM, Edward <[email protected]> wrote:
> >>> p.s., i think theres mis understand. it doesn't mean that graph will
> support only sequence file format. Main is whether converting at
> patitioning stage or loadVertices stage.
> >>>
> >>> Sent from my iPhone
> >>>
> >>> On May 6, 2013, at 8:09 AM, Suraj Menon <[email protected]> wrote:
> >>>
> >>>> Sure, Please go ahead.
> >>>>
> >>>>
> >>>> On Sun, May 5, 2013 at 6:52 PM, Edward J. Yoon <[email protected]
> >wrote:
> >>>>
> >>>>>>> Please let me know before this is changed, I would like to work on
> a
> >>>>>>> separate branch.
> >>>>>
> >>>>> I personally, we have to focus on high priority tasks. and more
> >>>>> feedbacks and contributions from users. So, if changes made, I'll
> >>>>> release periodically. If you want to work on another place, please
> do.
> >>>>> I don't want to wait your patches.
> >>>>>
> >>>>>
> >>>>> On Mon, May 6, 2013 at 7:49 AM, Edward J. Yoon <
> [email protected]>
> >>>>> wrote:
> >>>>>> For preparing integration with NoSQLs, of course, maybe condition
> >>>>>> check (whether converted or not) can be used without removing record
> >>>>>> converter.
> >>>>>>
> >>>>>> We need to discuss everything.
> >>>>>>
> >>>>>> On Mon, May 6, 2013 at 7:11 AM, Suraj Menon <[email protected]
> >
> >>>>> wrote:
> >>>>>>> I am still -1 if this means our graph module can work only on
> sequential
> >>>>>>> file format.
> >>>>>>> Please note that you can set record converter to null and make
> changes
> >>>>> to
> >>>>>>> loadVertices for what you desire here.
> >>>>>>>
> >>>>>>> If we came to this design, because TextInputFormat is inefficient,
> would
> >>>>>>> this work for Avro or Thrift input format?
> >>>>>>> Please let me know before this is changed, I would like to work on
> a
> >>>>>>> separate branch.
> >>>>>>> You may proceed as you wish.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Suraj
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sun, May 5, 2013 at 4:09 PM, Edward J. Yoon <
> [email protected]
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> I think 'record converter' should be removed. It's not good idea.
> >>>>>>>> Moreover, it's unnecessarily complex. To keep vertex input
> reader, we
> >>>>>>>> can move related classes into common module.
> >>>>>>>>
> >>>>>>>> Let's go with my original plan.
> >>>>>>>>
> >>>>>>>> On Sat, May 4, 2013 at 9:32 AM, Edward J. Yoon <
> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>> Hi all,
> >>>>>>>>>
> >>>>>>>>> I'm reading our old discussions about record converter, superstep
> >>>>>>>>> injection, and common module:
> >>>>>>>>>
> >>>>>>>>> - http://markmail.org/message/ol32pp2ixfazcxfc
> >>>>>>>>> - http://markmail.org/message/xwtmfdrag34g5xc4
> >>>>>>>>>
> >>>>>>>>> To clarify goals and objectives:
> >>>>>>>>>
> >>>>>>>>> 1. A parallel input partition is necessary for obtaining
> scalability
> >>>>>>>>> and elasticity of a Bulk Synchronous Parallel processing (It's
> not a
> >>>>>>>>> memory issue, or Disk/Spilling Queue, or HAMA-644. Please don't
> >>>>>>>>> shake).
> >>>>>>>>> 2. Input partitioning should be handled at BSP framework level,
> and
> >>>>> it
> >>>>>>>>> is for every Hama jobs, not only for Graph jobs.
> >>>>>>>>> 3. Unnecessary I/O Overhead need to be avoided, and NoSQLs input
> also
> >>>>>>>>> should be considered.
> >>>>>>>>>
> >>>>>>>>> The current problem is that every input of graph jobs should be
> >>>>>>>>> rewritten on HDFS. If you have a good idea, Please let me know.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best Regards, Edward J. Yoon
> >>>>>>>>> @eddieyoon
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best Regards, Edward J. Yoon
> >>>>>>>> @eddieyoon
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best Regards, Edward J. Yoon
> >>>>>> @eddieyoon
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Best Regards, Edward J. Yoon
> >>>>> @eddieyoon
> >>>>>
> >>
> >>
> >>
> >> --
> >> Best Regards, Edward J. Yoon
> >> @eddieyoon
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: Issues about Partitioning and Record converter

Reply via email to