Re: [DISCUSS] Cookbooks for users with knowledge in other frameworks
Thank you Reza. That separation makes sense to me. On Wed, May 29, 2019 at 6:26 PM Reza Rokni wrote: > +1 > > I think there will be at least two layers of this; > > Layer 1 - Using primitives : I do join, GBK, Aggregation... with system x > this way, what is the canonical equivalent in Beam. > Layer 2 - Patterns : I read and join Unbounded and Bounded Data in system > x this way, what is the canonical equivalent in Beam. > > I suspect as a first pass Layer 1 is reasonably well bounded work, there > would need to be agreement on "canonical" version of how to do something in > Beam as this could be seen to be opinionated. As there are often a > multitude of ways of doing x > Once we identify a set of layer 1 items, we could crowd source the canonical implementations. I believe we can use our usual code review process to settle on a version that is agreeable. (Examples have the same issue, they are probably opinionated today based on the author but it works out.) > > > On Thu, 30 May 2019 at 08:56, Ahmet Altay wrote: > >> Hi all, >> >> Inspired by the user asking about a Spark feature in Beam [1] in the >> release thread, I searched the user@ list and noticed a few instances of >> people asking for question like "I can do X in Spark, how can I do that in >> Beam?" Would it make sense to add documentation to explain how certain >> tasks that can be accomplished in Beam with side by side examples of doing >> the same task in Beam/Spark etc. It could help with on-boarding because it >> will be easier for people to leverage their existing knowledge. It could >> also help other frameworks as well, because it will serve as a Rosetta >> stone with two translations. >> >> Questions I have are: >> - Would such a thing be a helpful? >> - Is it feasible? Would a few pages worth of examples can cover enough >> use cases? >> >> Thank you! >> Ahmet >> >> [1] >> https://lists.apache.org/thread.html/b73a54aa1e6e9933628f177b04a8f907c26cac854745fa081c478eff@%3Cdev.beam.apache.org%3E >> > > > -- > > This email may be confidential and privileged. If you received this > communication by mistake, please don't forward it to anyone else, please > erase all copies and attachments, and please let me know that it has gone > to the wrong person. > > The above terms reflect a potential business arrangement, are provided > solely as a basis for further discussion, and are not intended to be and do > not constitute a legally binding obligation. No legally binding obligations > will be created, implied, or inferred until an agreement in final form is > executed in writing by all parties involved. >
Re: 1 Million Lines of Code (1 MLOC)
Interesting, so if we play with https://github.com/cgag/loc we could break it down further? I.e. test files vs code files? Which folders, etc. That could be interesting as well. On Fri, May 31, 2019 at 4:20 PM Brian Hulette wrote: > Dennis Nedry needed 2 million lines of code to control Jurassic Park, and > he only had to manage eight computers! I think we may actually need to pick > up the pace. > > On Fri, May 31, 2019 at 4:11 PM Anton Kedin wrote: > >> And to reduce the effort of future rewrites we should start doing it on a >> schedule. I propose we start over once a week :) >> >> On Fri, May 31, 2019 at 4:02 PM Lukasz Cwik wrote: >> >>> 1 million lines is too much, time to delete the entire project and start >>> over again, :-) >>> >>> On Fri, May 31, 2019 at 3:12 PM Ankur Goenka wrote: >>> Thanks for sharing. This is really interesting metrics. One use I can see is to track LOC vs Comments to make sure that we keep up with the practice of writing maintainable code. On Fri, May 31, 2019 at 3:04 PM Ismaël Mejía wrote: > I was checking some metrics in our codebase and found by chance that > we have passed the 1 million lines of code (MLOC). Of course lines of > code may not matter much but anyway it is interesting to see the size > of our project at this moment. > > This is the detailed information returned by loc [1]: > > > > Language FilesLinesBlank Comment > Code > > > Java 3681 67300778265 140753 >453989 > Python 497 1310822256013378 > 95144 > Go 333 1057751368111073 > 81021 > Markdown 20531989 65260 > 25463 > Plain Text 1121979 63590 > 15620 > Sass92 9867 1434 1900 > 6533 > JavaScript 19 5157 1197 467 > 3493 > YAML14 4601 454 1104 > 3043 > Bourne Shell30 3874 470 1028 > 2376 > Protobuf17 4258 677 1373 > 2208 > XML 17 2789 296 559 > 1934 > Kotlin 19 3501 347 1370 > 1784 > HTML60 2447 148 914 > 1385 > Batch3 249 570 > 192 > INI 1 206 21 16 > 169 > C++ 2 724 36 >32 > Autoconf 1 211 16 > 4 > > > Total 5002 1000874 132497 173987 >694390 > > > > [1] https://github.com/cgag/loc >
Re: 1 Million Lines of Code (1 MLOC)
Dennis Nedry needed 2 million lines of code to control Jurassic Park, and he only had to manage eight computers! I think we may actually need to pick up the pace. On Fri, May 31, 2019 at 4:11 PM Anton Kedin wrote: > And to reduce the effort of future rewrites we should start doing it on a > schedule. I propose we start over once a week :) > > On Fri, May 31, 2019 at 4:02 PM Lukasz Cwik wrote: > >> 1 million lines is too much, time to delete the entire project and start >> over again, :-) >> >> On Fri, May 31, 2019 at 3:12 PM Ankur Goenka wrote: >> >>> Thanks for sharing. >>> This is really interesting metrics. >>> One use I can see is to track LOC vs Comments to make sure that we keep >>> up with the practice of writing maintainable code. >>> >>> On Fri, May 31, 2019 at 3:04 PM Ismaël Mejía wrote: >>> I was checking some metrics in our codebase and found by chance that we have passed the 1 million lines of code (MLOC). Of course lines of code may not matter much but anyway it is interesting to see the size of our project at this moment. This is the detailed information returned by loc [1]: Language FilesLinesBlank Comment Code Java 3681 67300778265 140753 453989 Python 497 1310822256013378 95144 Go 333 1057751368111073 81021 Markdown 20531989 65260 25463 Plain Text 1121979 63590 15620 Sass92 9867 1434 1900 6533 JavaScript 19 5157 1197 467 3493 YAML14 4601 454 1104 3043 Bourne Shell30 3874 470 1028 2376 Protobuf17 4258 677 1373 2208 XML 17 2789 296 559 1934 Kotlin 19 3501 347 1370 1784 HTML60 2447 148 914 1385 Batch3 249 570 192 INI 1 206 21 16 169 C++ 2 724 36 32 Autoconf 1 211 16 4 Total 5002 1000874 132497 173987 694390 [1] https://github.com/cgag/loc >>>
Re: 1 Million Lines of Code (1 MLOC)
And to reduce the effort of future rewrites we should start doing it on a schedule. I propose we start over once a week :) On Fri, May 31, 2019 at 4:02 PM Lukasz Cwik wrote: > 1 million lines is too much, time to delete the entire project and start > over again, :-) > > On Fri, May 31, 2019 at 3:12 PM Ankur Goenka wrote: > >> Thanks for sharing. >> This is really interesting metrics. >> One use I can see is to track LOC vs Comments to make sure that we keep >> up with the practice of writing maintainable code. >> >> On Fri, May 31, 2019 at 3:04 PM Ismaël Mejía wrote: >> >>> I was checking some metrics in our codebase and found by chance that >>> we have passed the 1 million lines of code (MLOC). Of course lines of >>> code may not matter much but anyway it is interesting to see the size >>> of our project at this moment. >>> >>> This is the detailed information returned by loc [1]: >>> >>> >>> >>> Language FilesLinesBlank Comment >>>Code >>> >>> >>> Java 3681 67300778265 140753 >>> 453989 >>> Python 497 1310822256013378 >>> 95144 >>> Go 333 1057751368111073 >>> 81021 >>> Markdown 20531989 65260 >>> 25463 >>> Plain Text 1121979 63590 >>> 15620 >>> Sass92 9867 1434 1900 >>>6533 >>> JavaScript 19 5157 1197 467 >>>3493 >>> YAML14 4601 454 1104 >>>3043 >>> Bourne Shell30 3874 470 1028 >>>2376 >>> Protobuf17 4258 677 1373 >>>2208 >>> XML 17 2789 296 559 >>>1934 >>> Kotlin 19 3501 347 1370 >>>1784 >>> HTML60 2447 148 914 >>>1385 >>> Batch3 249 570 >>> 192 >>> INI 1 206 21 16 >>> 169 >>> C++ 2 724 36 >>> 32 >>> Autoconf 1 211 16 >>> 4 >>> >>> >>> Total 5002 1000874 132497 173987 >>> 694390 >>> >>> >>> >>> [1] https://github.com/cgag/loc >>> >>
Re: 1 Million Lines of Code (1 MLOC)
1 million lines is too much, time to delete the entire project and start over again, :-) On Fri, May 31, 2019 at 3:12 PM Ankur Goenka wrote: > Thanks for sharing. > This is really interesting metrics. > One use I can see is to track LOC vs Comments to make sure that we keep up > with the practice of writing maintainable code. > > On Fri, May 31, 2019 at 3:04 PM Ismaël Mejía wrote: > >> I was checking some metrics in our codebase and found by chance that >> we have passed the 1 million lines of code (MLOC). Of course lines of >> code may not matter much but anyway it is interesting to see the size >> of our project at this moment. >> >> This is the detailed information returned by loc [1]: >> >> >> >> Language FilesLinesBlank Comment >> Code >> >> >> Java 3681 67300778265 140753 >> 453989 >> Python 497 1310822256013378 >> 95144 >> Go 333 1057751368111073 >> 81021 >> Markdown 20531989 65260 >> 25463 >> Plain Text 1121979 63590 >> 15620 >> Sass92 9867 1434 1900 >> 6533 >> JavaScript 19 5157 1197 467 >> 3493 >> YAML14 4601 454 1104 >> 3043 >> Bourne Shell30 3874 470 1028 >> 2376 >> Protobuf17 4258 677 1373 >> 2208 >> XML 17 2789 296 559 >> 1934 >> Kotlin 19 3501 347 1370 >> 1784 >> HTML60 2447 148 914 >> 1385 >> Batch3 249 570 >> 192 >> INI 1 206 21 16 >> 169 >> C++ 2 724 36 >>32 >> Autoconf 1 211 16 >> 4 >> >> >> Total 5002 1000874 132497 173987 >> 694390 >> >> >> >> [1] https://github.com/cgag/loc >> >
Design Proposal for Cost Estimation
Dear Members of Apache Beam Dev List, My name is Alireza; I am a Software Engineer Intern at Google, and I am working closely with Anton on Beam SQL query optimizer. Currently, it uses Apache Calcite without any cost estimation; I am proposing to implement the cost estimator for it. The first step would be implementing cost estimator for the sources; this is my design proposal for this implementation. I will appreciate your comments and suggestions. https://docs.google.com/document/d/1vi1PBBu5IqSy-qZl1Gk-49CcANOpbNs1UAud6LnOaiY/edit#heading=h.6rlkpwwx7gvf Best, Alireza Samadian
Re: [DISCUSS] Portability representation of schemas
> Can you propose what the protos would look like in this case? Right now LogicalType does not contain the to/from conversion functions in the proto. Do you think we'll need to add these in? Maybe. Right now the proposed LogicalType message is pretty simple/generic: message LogicalType { FieldType representation = 1; string logical_urn = 2; bytes logical_payload = 3; } If we keep just logical_urn and logical_payload, the logical_payload could itself be a protobuf with attributes of 1) a serialized class and 2/3) to/from functions. Or, alternatively, we could have a generalization of the SchemaRegistry for logical types. Implementations for standard types and user-defined types would be registered by URN, and the SDK could look them up given just a URN. I put a brief section about this alternative in the doc last week [1]. What I suggested there included removing the logical_payload field, which is probably overkill. The critical piece is just relying on a registry in the SDK to look up types and to/from functions rather than storing them in the portable schema itself. I kind of like keeping the LogicalType message generic for now, since it gives us a way to try out these various approaches, but maybe that's just a cop out. [1] https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit?ts=5cdf6a5b#heading=h.jlt5hdrolfy On Fri, May 31, 2019 at 12:36 PM Reuven Lax wrote: > > > On Tue, May 28, 2019 at 10:11 AM Brian Hulette > wrote: > >> >> >> On Sun, May 26, 2019 at 1:25 PM Reuven Lax wrote: >> >>> >>> >>> On Fri, May 24, 2019 at 11:42 AM Brian Hulette >>> wrote: >>> *tl;dr:* SchemaCoder represents a logical type with a base type of Row and we should think about that. I'm a little concerned that the current proposals for a portable representation don't actually fully represent Schemas. It seems to me that the current java-only Schemas are made up three concepts that are intertwined: (a) The Java SDK specific code for schema inference, type coercion, and "schema-aware" transforms. (b) A RowCoder[1] that encodes Rows[2] which have a particular Schema[3]. (c) A SchemaCoder[4] that has a RowCoder for a particular schema, and functions for converting Rows with that schema to/from a Java type T. Those functions and the RowCoder are then composed to provider a Coder for the type T. >>> >>> RowCoder is currently just an internal implementation detail, it can be >>> eliminated. SchemaCoder is the only thing that determines a schema today. >>> >> Why not keep it around? I think it would make sense to have a RowCoder >> implementation in every SDK, as well as something like SchemaCoder that >> defines a conversion from that SDK's "Row" to the language type. >> > > The point is that from a programmer's perspective, there is nothing much > special about Row. Any type can have a schema, and the only special thing > about Row is that it's always guaranteed to exist. From that standpoint, > Row is nearly an implementation detail. Today RowCoder is never set on > _any_ PCollection, it's literally just used as a helper library, so there's > no real need for it to exist as a "Coder." > > >> >>> We're not concerned with (a) at this time since that's specific to the SDK, not the interface between them. My understanding is we just want to define a portable representation for (b) and/or (c). What has been discussed so far is really just a portable representation for (b), the RowCoder, since the discussion is only around how to represent the schema itself and not the to/from functions. >>> >>> Correct. The to/from functions are actually related to a). One of the >>> big goals of schemas was that users should not be forced to operate on rows >>> to get schemas. A user can create PCollection and as long as >>> the SDK can infer a schema from MyRandomType, the user never needs to even >>> see a Row object. The to/fromRow functions are what make this work today. >>> >>> >> >> One of the points I'd like to make is that this type coercion is a useful >> concept on it's own, separate from schemas. It's especially useful for a >> type that has a schema and is encoded by RowCoder since that can represent >> many more types, but the type coercion doesn't have to be tied to just >> schemas and RowCoder. We could also do type coercion for types that are >> effectively wrappers around an integer or a string. It could just be a >> general way to map language types to base types (i.e. types that we have a >> coder for). Then it just becomes a general framework for extending coders >> to represent more language types. >> > > Let's not tie those conversations. Maybe a similar concept will hold true > for general coders (or we might decide to get rid of coders in favor of > schemas, in which case that becomes moot), but I don't think we should > prematurely generalize. > > >> >> >> >>>
Re: 1 Million Lines of Code (1 MLOC)
Thanks for sharing. This is really interesting metrics. One use I can see is to track LOC vs Comments to make sure that we keep up with the practice of writing maintainable code. On Fri, May 31, 2019 at 3:04 PM Ismaël Mejía wrote: > I was checking some metrics in our codebase and found by chance that > we have passed the 1 million lines of code (MLOC). Of course lines of > code may not matter much but anyway it is interesting to see the size > of our project at this moment. > > This is the detailed information returned by loc [1]: > > > > Language FilesLinesBlank Comment > Code > > > Java 3681 67300778265 140753 > 453989 > Python 497 1310822256013378 > 95144 > Go 333 1057751368111073 > 81021 > Markdown 20531989 65260 > 25463 > Plain Text 1121979 63590 > 15620 > Sass92 9867 1434 1900 > 6533 > JavaScript 19 5157 1197 467 > 3493 > YAML14 4601 454 1104 > 3043 > Bourne Shell30 3874 470 1028 > 2376 > Protobuf17 4258 677 1373 > 2208 > XML 17 2789 296 559 > 1934 > Kotlin 19 3501 347 1370 > 1784 > HTML60 2447 148 914 > 1385 > Batch3 249 570 > 192 > INI 1 206 21 16 > 169 > C++ 2 724 36 >32 > Autoconf 1 211 16 > 4 > > > Total 5002 1000874 132497 173987 > 694390 > > > > [1] https://github.com/cgag/loc >
1 Million Lines of Code (1 MLOC)
I was checking some metrics in our codebase and found by chance that we have passed the 1 million lines of code (MLOC). Of course lines of code may not matter much but anyway it is interesting to see the size of our project at this moment. This is the detailed information returned by loc [1]: Language FilesLinesBlank Comment Code Java 3681 67300778265 140753 453989 Python 497 131082225601337895144 Go 333 105775136811107381021 Markdown 20531989 6526025463 Plain Text 1121979 6359015620 Sass92 9867 1434 1900 6533 JavaScript 19 5157 1197 467 3493 YAML14 4601 454 1104 3043 Bourne Shell30 3874 470 1028 2376 Protobuf17 4258 677 1373 2208 XML 17 2789 296 559 1934 Kotlin 19 3501 347 1370 1784 HTML60 2447 148 914 1385 Batch3 249 570 192 INI 1 206 21 16 169 C++ 2 724 36 32 Autoconf 1 211 164 Total 5002 1000874 132497 173987 694390 [1] https://github.com/cgag/loc
Re: [VOTE] Release 2.13.0, release candidate #2
+1 I validated python 2 quickstarts. On Fri, May 31, 2019 at 10:22 AM Lukasz Cwik wrote: > I did the Java local quickstart for all the runners in the release > validation sheet and gearpump failed for me due to a missing dependency. > Even after I fixed up the dependency, the pipeline then got stuck. I filed > BEAM-7467 with all the details. > > Note that I tried the quickstart for 2.8.0 through 2.12.0 > 2.8.0 and 2.9.0 failed due to a timeout (maybe I was using the wrong > command but this test[1] suggests that I was using a correct one) > 2.10.0 and higher fail due to the missing gs-collections dependency. > > Manu, could you help figure out what is going on? > > 1: > https://github.com/apache/beam/blob/2d3bcdc542536037c3e657a8b00ebc222487476b/release/src/main/groovy/quickstart-java-gearpump.groovy#L33 > > On Thu, May 30, 2019 at 7:53 PM Ankur Goenka wrote: > >> Hi everyone, >> >> Please review and vote on the release candidate #2 for the version >> 2.13.0, as follows: >> >> [ ] +1, Approve the release >> [ ] -1, Do not approve the release (please provide specific comments) >> >> The complete staging area is available for your review, which includes: >> * JIRA release notes [1], >> * the official Apache source release to be deployed to dist.apache.org >> [2], which is signed with the key with fingerprint >> 6356C1A9F089B0FA3DE8753688934A6699985948 [3], >> * all artifacts to be deployed to the Maven Central Repository [4], >> * source code tag "v2.13.0-RC2" [5], >> * website pull request listing the release [6] and publishing the API >> reference manual [7]. >> * Python artifacts are deployed along with the source release to the >> dist.apache.org [2]. >> * Validation sheet with a tab for 2.13.0 release to help with validation >> [8]. >> >> The vote will be open for at least 72 hours. It is adopted by majority >> approval, with at least 3 PMC affirmative votes. >> >> Thanks, >> Ankur >> >> [1] >> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166 >> [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/ >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS >> [4] >> https://repository.apache.org/content/repositories/orgapachebeam-1070/ >> [5] https://github.com/apache/beam/tree/v2.13.0-RC2 >> [6] https://github.com/apache/beam/pull/8645 >> [7] https://github.com/apache/beam-site/pull/589 >> [8] >> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952 >> >
Re: [VOTE] Release 2.13.0, release candidate #2
I did the Java local quickstart for all the runners in the release validation sheet and gearpump failed for me due to a missing dependency. Even after I fixed up the dependency, the pipeline then got stuck. I filed BEAM-7467 with all the details. Note that I tried the quickstart for 2.8.0 through 2.12.0 2.8.0 and 2.9.0 failed due to a timeout (maybe I was using the wrong command but this test[1] suggests that I was using a correct one) 2.10.0 and higher fail due to the missing gs-collections dependency. Manu, could you help figure out what is going on? 1: https://github.com/apache/beam/blob/2d3bcdc542536037c3e657a8b00ebc222487476b/release/src/main/groovy/quickstart-java-gearpump.groovy#L33 On Thu, May 30, 2019 at 7:53 PM Ankur Goenka wrote: > Hi everyone, > > Please review and vote on the release candidate #2 for the version 2.13.0, > as follows: > > [ ] +1, Approve the release > [ ] -1, Do not approve the release (please provide specific comments) > > The complete staging area is available for your review, which includes: > * JIRA release notes [1], > * the official Apache source release to be deployed to dist.apache.org > [2], which is signed with the key with fingerprint > 6356C1A9F089B0FA3DE8753688934A6699985948 [3], > * all artifacts to be deployed to the Maven Central Repository [4], > * source code tag "v2.13.0-RC2" [5], > * website pull request listing the release [6] and publishing the API > reference manual [7]. > * Python artifacts are deployed along with the source release to the > dist.apache.org [2]. > * Validation sheet with a tab for 2.13.0 release to help with validation > [8]. > > The vote will be open for at least 72 hours. It is adopted by majority > approval, with at least 3 PMC affirmative votes. > > Thanks, > Ankur > > [1] > https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166 > [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/ > [3] https://dist.apache.org/repos/dist/release/beam/KEYS > [4] https://repository.apache.org/content/repositories/orgapachebeam-1070/ > [5] https://github.com/apache/beam/tree/v2.13.0-RC2 > [6] https://github.com/apache/beam/pull/8645 > [7] https://github.com/apache/beam-site/pull/589 > [8] > https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952 >
Re: Support for PaneInfo in Python SDK
Hi Pablo, Thanks for that example, it would be great to be able to use fileio.WriteToFiles transform to write files with filenames that are based on their PaneInfo. Thanks @Charles Chen, for adding the remaining work on the issue - the emission of PaneInfo in the Python implementation of GBK (either in the Python FnAPIRunner or the old Python DirectRunner / triggers.py). I will certainly make sure that [BEAM-3759] is completed after my GSoC project is implemented. It's a good opportunity to get into the runner code. Regards - TT On Fri, May 31, 2019 at 2:35 AM Pablo Estrada wrote: > Hi Tanay, > thanks for bringing this to the mailing list. I believe this is certainly > useful, and necessary. As an example, the fileio.WriteToFiles transform > does not work well without PaneInfo data (since we can't know how many > firings there are for each window, and we can't give names to files based > on this). > > Best > -P. > > On Thu, May 30, 2019 at 1:00 PM Tanay Tummalapalli > wrote: > >> Hi everyone, >> >> The PR linked in [BEAM-3759] - "Add support for PaneInfo descriptor in >> Python SDK"[1] was merged, but, the issue is still open. >> There might be some work left on this for full support for PaneInfo. Eg: >> Although the PaneInfo class exists, it is not accessible in a DoFn via a >> kwarg(PaneInfoParam) like TimestampParam or WindowParam. >> >> Please let me know the remaining work to be done on this issue as this >> may be needed in the near future. >> >> Regards >> Tanay Tummalapalli >> >> [1] https://issues.apache.org/jira/browse/BEAM-3759 >> >