On Sun, Mar 26, 2017 at 10:42 PM, Wes Turner <wes.tur...@gmail.com> wrote:
> > > On Sun, Mar 26, 2017 at 10:02 AM, Nick Coghlan <ncogh...@gmail.com> wrote: > >> On 26 March 2017 at 21:40, Pavel Velikhov <pavel.velik...@gmail.com> >> wrote: >> > On 25 Mar 2017, at 19:40, Nick Coghlan <ncogh...@gmail.com> wrote: >> >> Right, the target audience here *isn't* folks who already know how to >> >> construct their own relational queries in SQL, and it definitely isn't >> >> folks that know how to tweak their queries to get optimal performance >> >> from the specific database they're using. Rather, it's folks that >> >> already know Python's comprehensions, and perhaps some of the >> >> itertools features, and helping to provide them with a smoother >> >> on-ramp into the world of relational data processing. >> > >> > > >> > >> > Actually I myself am a user of PythonQL, even though I’m an SQL expert. >> I work in data science, so >> > I do a lot of ad-hoc querying and we always get some new datasets we >> need to check out and work with. >> > Some things like nested data models are also much better handled by >> PythonQL, and data like >> > JSON or XML will also be easier to handle. >> >> So perhaps a better way of framing it would be to say that PythonQL >> aims to provide a middle ground between interfaces that are fully in >> "Python mode" (e.g ORMs, pandas DataFrames), where the primary >> interface is methods-on-objects, and those that are fully in "data >> manipulation mode" (e.g. raw SQL, lower level XML and JSON APIs). >> >> At the Python level, success for PythonQL would look like people being >> able to seamlessly transfer their data manipulation skills from a >> Django ORM project to an SQL Alchemy project to a pandas analysis >> project to a distributed data analysis project in dask, without their >> data manipulation code really having to change - only the backing data >> structures and the runtime performance characteristics would differ. >> > > e.g. Django ORM to SQLAlchemy: > - Does this necessarily imply a metamodel for relations? > - Django: GenericForeignKey > - SQLAlchemy: sqlalchemy_utils.generic_relationship > > Does this necessarily imply a metamodel for relations? Edges are expressed differently in different frameworks; ultimately you're looking at a projection of a graph (a constructed subset of a graph). So solving this in the general case implies solving for graphs (as well as (which includes) tree-based hierarchical data like SQL, arrays, documents, keys and values)). 1. Schema ("metamodel") 2. Query language Q: How can Linked Data help define a metamodel for expressing relations (in order to harmonize search of disparate datasets)? - It's a graph with schema constraints. - Use URIs for Classes ("types"), Properties ("columns", "attributes"), and instances with @ids ("rows") - RDF, RDFS, OWL - Search n databases asynchronously with SPARQL federation - Native SPARQL database (adapt the data) - SPARQL facade/frontend (adapt to an interface) - Define/generate a schema representation for arbitrary data sources which {query language} {implementation A} can use to plan data-local queries and transformations - JSONLD @context for data sources ### Property Relations are expressed as properties of class instances. rdf:Property schema:Property https://meta.schema.org/Property - https://meta.schema.org/inverseOf owl:inverseOf https://www.w3.org/TR/owl-ref/#inverseOf-def Q: > "How can you provide documentation about the columns in a CSV file?" https://www.w3.org/TR/tabular-data-primer/#documentation-columns A: CSVW as [JSONLD,] A: https://wrdrd.com/docs/consulting/linkedreproducibility#csv-csvw-and-metadata-rows - How do we make these work with various stores? - How do we include columnar metadata like physical units and precision in databases without support for it? - JSON-LD manifest? AFAIU, these don't handle relations: - http://datashape.pydata.org/ - https://github.com/apache/arrow/blob/master/format/Metadata.md Q: "How can you describe the schema for multi-dimensional datasets (with complex relations)?" A: https://www.w3.org/TR/vocab-data-cube/#data-cubes The relations are otherwise defined as RDFS/OWL (e.g. as JSON-LD). ## Graph queries ### SPARQL - SPARQL is a W3C Web Standard query language. - SPARQL is not the only graph query language. ### Blueprints, Gremlin Blueprints is a graph traversal/query API. - There are many blueprints API implementations (e.g. Rexster, Neo4j <https://wrdrd.com/docs/consulting/knowledge-engineering#neo4j>, Blazegraph <https://wrdrd.com/docs/consulting/knowledge-engineering#blazegraph>, Accumulo <https://wrdrd.com/docs/consulting/knowledge-engineering#accumulo>) Gremlin implements the Blueprints API (also in Python); it's also generic like LINQ (like JDBC for graph databases): https://tinkerpop.apache.org/docs/current/reference/#gremlin-python ### GraphQL https://github.com/graphql-python/ ... supporting relations across ORMs would be cool; with enough abstraction IDK why it wouldn't look like RDFS/OWL. > > ... > > >> At the data manipulation layer, success for PythonQL would look like >> people being able to easily get "good enough" performance for one-off >> scripts, regardless of the backing data store, with closer attention >> to detail only being needed for genuinely large data sets (where >> efficiency matters even for one-off analyses), or for frequently >> repeated operations (where wasted CPU hours show up as increased >> infrastructure expenses). >> > > > http://pandas.pydata.org/pandas-docs/stable/ecosystem.html#out-of-core > (dask, blaze, odo, ) > > http://blaze.pydata.org/ > > - blaze > - | Src: https://github.com/blaze/blaze > - | Docs: https://blaze.readthedocs.io/en/latest/rosetta-pandas.html > - | Docs: https://blaze.readthedocs.io/en/latest/rosetta-sql.html > - | Docs: https://blaze.readthedocs.io/en/latest/backends.html > - "Python, Pandas, SQLAlchemy, MongoDB, PyTables, and Spark" > *URIs* https://blaze.readthedocs.io/en/latest/uri.html#what-sorts-of-uris-does-blaze-support ``` What sorts of URIs does Blaze support? Paths to files on disk, including the following extensions - .csv - .json - .csv.gz/json.gz - .hdf5 (uses h5py) - .hdf5::/datapath - hdfstore://filename.hdf5 (uses special pandas.HDFStore format) - .bcolz - .xls(x) SQLAlchemy strings like the following - sqlite:////absolute/path/to/myfile.db::tablename - sqlite:////absolute/path/to/myfile.db (specify a particular table) - postgresql://username:password@hostname:port - impala://hostname (uses impyla) - anything supported by SQLAlchemy MongoDB Connection strings of the following form - mongodb://username:password@hostname:port/database_name::collection_name Blaze server strings of the following form - blaze://hostname:port (port defaults to 6363) In all cases when a location or table name is required in addition to the traditional URI (e.g. a data path within an HDF5 file or a Table/Collection name within a database) *then that information follows on the end of the URI after a separator of two colons ::.* How it works Blaze depends on the Odo library to handle URIs. URIs are managed through the resource function which is dispatched based on regular expressions. For example a simple resource function to handle .json files might look like the following (although Blaze’s actual solution is a bit more comprehensive): from blaze import resource import json @resource.register('.+\.json') def resource_json(uri): with open(uri): data = json.load(uri) return data Can I extend this to my own types? Absolutely. Import and extend *resource* as shown in the “How it works” section. The rest of Blaze will pick up your change automatically. ``` > > - dask > - | Src: https://github.com/dask/dask > - | Docs: http://dask.pydata.org/en/latest/#familiar-user-interface ** > - | Docs: http://dask.pydata.org/en/latest/scheduler-choice.html > - http://xarray.pydata.org/en/stable/dask.html > > - odo > - | Src: https://github.com/blaze/blaze#odo > - | Docs: https://odo.readthedocs.io/en/latest/#formats > > - zero-copy > - https://www.slideshare.net/wesm/memory-interoperability- > in-analytics-and-machine-learning > - https://github.com/alex/zero_buffer/blob/master/zero_buffer.py > > > - ibis > - | Src: https://github.com/cloudera/ibis > - (a backend may compile to LLVM) > - seeAlso: blaze, dask, "bulk synchronous parallel" > - | Docs: http://docs.ibis-project.org/sql.html > - | Docs: http://docs.ibis-project.org/tutorial.html "Expression > tutortial" jupyter notebooks > - | Docs: http://docs.ibis-project.org/ > - Apache Impala (incubating) > - Apache Kudu (incubating) > - Hadoop Distributed File System (HDFS) > - PostgreSQL (Experimental) > - SQLite > - [ SQLAlchemy: { ... } ] > - | Src: https://github.com/cloudera/ibis/blob/master/ibis/sql/ > alchemy.py > > - apache beam > - https://beam.apache.org/documentation/sdks/python/ > - https://beam.apache.org/get-started/quickstart-py/ (pip install > apache-beam) > - https://beam.apache.org/documentation/sdks/pydoc/0.6.0/ > - apache_beam.transforms > - https://beam.apache.org/documentation/programming-guide/#transforms > "Applying transforms" > > Somewhere in this list, > these become big data tools. > > >> >> There's no question that folks dealing with sufficiently large data >> >> sets with sufficiently stringent performance requirements are >> >> eventually going to want to reach for handcrafted SQL or a distributed >> >> computation framework like dask, but that's not really any different >> >> from our standard position that when folks are attempting to optimise >> >> a hot loop, they're eventually going to have to switch to something >> >> that can eliminate the interpreter's default runtime object management >> >> overhead (whether that's Cython, PyPy's or Numba's JIT, or writing an >> >> extension module in a different language entirely). It isn't an >> >> argument against making it easier for folks to postpone the point >> >> where they find it necessary to reach for the "something else" that >> >> takes them beyond Python's default capabilities. >> > >> > Don’t know, for example one of the wrappers is going to be an Apache >> Spark >> > wrappers, so you could quickly hack up a PythonQL query that would be >> run >> > on a distributed platform. >> >> Right, I meant this in the same sense that folks using an ORM like SQL >> Alchemy may eventually hit a point where rather than trying to >> convince the ORM to emit the SQL they want to run, it's easier to just >> bypass the ORM layer and write the exact SQL they want. >> > > At that point one can either: > - reflect the tables/mappings at devtime > - reflect the tables/mappings at runtime > > And then run the raw DBAPI query > (using appropriate query interpolation > (-> i-strings and scoped configuration state)): > > session.execute("SELECT dbapi_version FROM ?", "tbl;name") > > > >> It's worthwhile attempting to reduce the number of cases where folks >> feel obliged to do that, but at the same time, abstraction layers need >> to hide at least some lower level details if they're going to actually >> work properly. >> >> >> = Option 1 = >> >> >> >> Fully commit to the model of allowing alternate syntactic dialects to >> >> run atop Python interpreters. In Hylang and PythonQL we have at least >> >> two genuinely interesting examples of that working through the text >> >> encoding system, as well as other examples like Cython that work >> >> through the extension module system. >> >> >> >> So that's an opportunity to take this from "Possible, but a bit hacky" >> >> to "Pluggable source code translation is supported at all levels of >> >> the interpreter, including debugger source maps, etc" (perhaps by >> >> borrowing ideas from other ecosytems like Java, JavaScript, and .NET, >> >> where this kind of thing is already a lot more common. >> >> >> >> The downside of this approach is that actually making it happen would >> >> be getting pretty far afield from the original PythonQL goal of >> >> "provide nicer data manipulation abstractions in Python", and it >> >> wouldn't actually deliver anything new that can't already be done with >> >> existing import and codec system features. > > > >> > This would be great anyways, if we could rely on some preprocessor >> directive, >> > instead of hacking encodings, this could be nice. >> >> Victor Stinner wrote up some ideas about that in PEP 511: >> https://www.python.org/dev/peps/pep-0511/ >> >> Preprocessing is one of the specific uses cases considered: >> https://www.python.org/dev/peps/pep-0511/#usage-2-preprocessor >> >> >> = Option 2 = >> >> >> >> ... given optionally delayed >> >> rendering of interpolated strings, PythonQL could be used in the form: >> >> >> >> result =pyql(i""" >> >> (x,y) >> >> for x in {range(1,8)} >> >> for y in {range(1,7)} >> >> if x % 2 == 0 and >> >> y % 2 != 0 and >> >> x > y >> >> """) >> >> >> >> I personally like this idea (otherwise I wouldn't have written PEP 501 >> >> in the first place), and the necessary technical underpinnings to >> >> enable it are all largely already in place to support f-strings. If >> >> the PEP were revised to show examples of using it to support >> >> relatively seamless calling back and forth between Hylang, PythonQL >> >> and regular Python code in the same process, that might be intriguing >> >> enough to pique Guido's interest (and I'm open to adding co-authors >> >> that are interested in pursuing that). >> > >> > What would be the difference between this and just executing a PythonQL >> > string for us, getting local and global variables into PythonQL scope? >> >> The big new technical capability that f-strings introduced is that the >> compiler can see the variable references in the embedded expressions, >> so f-strings "just work" with closure references, whereas passing >> locals() and globals() explicitly is: >> >> 1. slow (since you have to generate a full locals dict); >> 2. incompatible with the use of closure variables (since they're not >> visible in either locals() *or* globals()) >> >> The i-strings concept takes that closure-compatible interpolation >> capability and separates it from the str.format based rendering step. >> >> From a speed perspective, the interpolation aspects of this approach >> are so efficient they rival simple string concatenation: >> >> $ python -m perf timeit -s 'first = "Hello"; second = " World!"' >> 'first + second' >> ..................... >> Mean +- std dev: 71.7 ns +- 2.1 ns >> >> $ python -m perf timeit -s 'first = "Hello"; second = " World!"' >> 'f"{first}{second}"' >> ..................... >> Mean +- std dev: 77.8 ns +- 2.5 ns >> >> Something like pyql that did more than just concatenate the text >> sections with the text values of the embedded expressions would still >> need some form of regex-style caching strategy to avoid parsing the >> same query string multiple times, but the Python interpreter would >> handle the task of breaking up the string into the text sections and >> the interpolated Python expressions. >> >> Cheers, >> Nick. >> >> -- >> Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia >> _______________________________________________ >> Python-ideas mailing list >> Python-ideas@python.org >> https://mail.python.org/mailman/listinfo/python-ideas >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > >
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/