I am also skeptical, but I want to be sure - the next thing I would do is 
stepping through with a debugger to see if the query gets altered in any way 
before it’s send out. Is it possible to step through with pdb when triggering 
via “airflow run” ?

On 27.09.2017, 22:56, "Chris Riccomini" 
<[email protected]<mailto:[email protected]>> wrote:

I am highly skeptical that it's the library.

On Wed, Sep 27, 2017 at 1:50 PM, Tobias Feldhaus 
<[email protected]<mailto:[email protected]>> wrote:
This was exactly my point. Before I dig deeper I want to build a very minimum 
PythonOperator that uses the new library as I am currently
 comparing apples with oranges (same query, same data, different libraries). 
Although it really puzzles me how a different library can yield different (read 
as: some is missing) data – when it’s job is just to execute a query and not 
pulling and transforming it.


On 27.09.2017, 19:43, "Chris Riccomini" 
<[email protected]<mailto:[email protected]>> wrote:

    Interesting. Just saw:

    https://github.com/google/google-api-python-client

    > This client library is supported but in maintenance mode only. We are
    fixing necessary bugs and adding essential features to ensure this library
    continues to meet your needs for accessing Google APIs. Non-critical issues
    will be closed. Any issue may be reopened if it is causing ongoing problems.

    Looks like we might want to migrate at some point. It'll be a big change.
    <https://github.com/google/google-api-python-client#about>

    On Wed, Sep 27, 2017 at 10:41 AM, Chris Riccomini 
<[email protected]<mailto:[email protected]>>
    wrote:

    > AFAIK, google-api-python-client is not in maintenance mode. In fact, I
    > believe the idiomatic Python library (google-cloud-python) is built off 
of google-api-python-client,
    > I believe. I have spoken with several Google cloud PMs who have pointed me
    > at google-api-python-client as the canonical library to use, and the one
    > that receives updates for new products first (before google-cloud-python).
    >
    > On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus <
    > [email protected]<mailto:[email protected]>> 
wrote:
    >
    >> Sounds like a possible solution, however to avoid hitting this problem
    >> I’ve deleted all the tables before rerunning stuff. I think it might have
    >> to do with the library. Airflow uses google-api-python-client which is in
    >> maintenance mode and Google suggests switching to google-cloud-python. I
    >> will write a PythonOperator DAG tomorrow and will check DAG against DAG
    >> then to see if the library could be the problem.
    >>
    >> On 27.09.2017, 19:15, "Chris Riccomini" 
<[email protected]<mailto:[email protected]>> wrote:
    >>
    >>     Is it possible that you were getting a cache hit with the BQ 
operator?
    >>
    >>     https://cloud.google.com/bigquery/docs/cached-results#bigque
    >> ry-query-cache-api
    >>
    >>     The operator does not currently expose this flag, and I couldn't find
    >>     whether the cache defaults to on or off for insert-job API.
    >>
    >>     On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus <
    >>     
[email protected]<mailto:[email protected]>> wrote:
    >>
    >>     > I’ve created a table with only the missing value in the exact same
    >>     > partition, and then it’s going through. Could it be that the volume
    >> of the
    >>     > data plays a role or the client libraries maybe?
    >>     >
    >>     > On 27.09.2017, 17:46, "Tobias Feldhaus" <
    >> [email protected]<mailto:[email protected]>>
    >>     > wrote:
    >>     >
    >>     >     Hi,
    >>     >
    >>     >
    >>     >     I am tracing a bug in one of our data pipelines and I narrowed
    >> it down
    >>     > to some small number of events not being in a table (using Airflow
    >> 1.8.2).
    >>     >     After running the query myself that airflow executed
    >> interactively, I
    >>     > saw the missing entry. When airflow executed the same query, and
    >> writes the
    >>     > results to a partitioned table in BQ it was missing in that
    >> destination
    >>     > table.
    >>     >     I’ve tried different scenarios now several times and the only
    >>     > explanation or difference I can come up with, is that airflow
    >> _might_ be
    >>     > that using partitioned tables is not fully supported or there is
    >> some weird
    >>     > bug in the bigquery-python implementation.
    >>     >
    >>     >     When deleting the table and recreating it and reloading the
    >> complete
    >>     > date with airflow the data is still missing. When reloading a
    >> single day,
    >>     > it is also missing. I’ve created a python script to execute the
    >> exact same
    >>     > query and it works as expected.
    >>     >
    >>     >     Any advice how to track this down further? Is this a known
    >> issue?
    >>     >
    >>     >     Best,
    >>     >     Tobias
    >>     >
    >>     >
    >>     >
    >>     >
    >>     >
    >>
    >>
    >>
    >


Reply via email to