I think I found the issue. I was rerunning everything again and I found that now the respective date was there, but another date was missing. After some investigations I stumbled upon this:
Airflow simply didnt process some days of the month (August) that I was reprocessing. It simply didnt process August 24th yesterday, and now it was missing August 17th and 18th! [Screenshot for Airflow interface showing the run for 2017-08-16 run 17/18 are missing, 19 is the next one: https://puu.sh/xKRKH/0cc9bc01d6.png [Screenshot for Airflow interface showing the run for 2017-08-19: https://puu.sh/xKRL9/0ac26fb476.png] What could be the reason for this? Did the clearing command via the webinterface maybe fail? Why are the days no longer shown in the webinterface at all? On 27.09.2017, 23:20, "Tobias Feldhaus" <[email protected]> wrote: I am also skeptical, but I want to be sure - the next thing I would do is stepping through with a debugger to see if the query gets altered in any way before it’s send out. Is it possible to step through with pdb when triggering via “airflow run” ? On 27.09.2017, 22:56, "Chris Riccomini" <[email protected]<mailto:[email protected]>> wrote: I am highly skeptical that it's the library. On Wed, Sep 27, 2017 at 1:50 PM, Tobias Feldhaus <[email protected]<mailto:[email protected]>> wrote: This was exactly my point. Before I dig deeper I want to build a very minimum PythonOperator that uses the new library as I am currently comparing apples with oranges (same query, same data, different libraries). Although it really puzzles me how a different library can yield different (read as: some is missing) data – when it’s job is just to execute a query and not pulling and transforming it. On 27.09.2017, 19:43, "Chris Riccomini" <[email protected]<mailto:[email protected]>> wrote: Interesting. Just saw: https://github.com/google/google-api-python-client > This client library is supported but in maintenance mode only. We are fixing necessary bugs and adding essential features to ensure this library continues to meet your needs for accessing Google APIs. Non-critical issues will be closed. Any issue may be reopened if it is causing ongoing problems. Looks like we might want to migrate at some point. It'll be a big change. <https://github.com/google/google-api-python-client#about> On Wed, Sep 27, 2017 at 10:41 AM, Chris Riccomini <[email protected]<mailto:[email protected]>> wrote: > AFAIK, google-api-python-client is not in maintenance mode. In fact, I > believe the idiomatic Python library (google-cloud-python) is built off of google-api-python-client, > I believe. I have spoken with several Google cloud PMs who have pointed me > at google-api-python-client as the canonical library to use, and the one > that receives updates for new products first (before google-cloud-python). > > On Wed, Sep 27, 2017 at 10:34 AM, Tobias Feldhaus < > [email protected]<mailto:[email protected]>> wrote: > >> Sounds like a possible solution, however to avoid hitting this problem >> I’ve deleted all the tables before rerunning stuff. I think it might have >> to do with the library. Airflow uses google-api-python-client which is in >> maintenance mode and Google suggests switching to google-cloud-python. I >> will write a PythonOperator DAG tomorrow and will check DAG against DAG >> then to see if the library could be the problem. >> >> On 27.09.2017, 19:15, "Chris Riccomini" <[email protected]<mailto:[email protected]>> wrote: >> >> Is it possible that you were getting a cache hit with the BQ operator? >> >> https://cloud.google.com/bigquery/docs/cached-results#bigque >> ry-query-cache-api >> >> The operator does not currently expose this flag, and I couldn't find >> whether the cache defaults to on or off for insert-job API. >> >> On Wed, Sep 27, 2017 at 9:41 AM, Tobias Feldhaus < >> [email protected]<mailto:[email protected]>> wrote: >> >> > I’ve created a table with only the missing value in the exact same >> > partition, and then it’s going through. Could it be that the volume >> of the >> > data plays a role or the client libraries maybe? >> > >> > On 27.09.2017, 17:46, "Tobias Feldhaus" < >> [email protected]<mailto:[email protected]>> >> > wrote: >> > >> > Hi, >> > >> > >> > I am tracing a bug in one of our data pipelines and I narrowed >> it down >> > to some small number of events not being in a table (using Airflow >> 1.8.2). >> > After running the query myself that airflow executed >> interactively, I >> > saw the missing entry. When airflow executed the same query, and >> writes the >> > results to a partitioned table in BQ it was missing in that >> destination >> > table. >> > I’ve tried different scenarios now several times and the only >> > explanation or difference I can come up with, is that airflow >> _might_ be >> > that using partitioned tables is not fully supported or there is >> some weird >> > bug in the bigquery-python implementation. >> > >> > When deleting the table and recreating it and reloading the >> complete >> > date with airflow the data is still missing. When reloading a >> single day, >> > it is also missing. I’ve created a python script to execute the >> exact same >> > query and it works as expected. >> > >> > Any advice how to track this down further? Is this a known >> issue? >> > >> > Best, >> > Tobias >> > >> > >> > >> > >> > >> >> >> >
