Re: Project Genevere

Gerard Toonstra Sat, 01 Jun 2019 02:13:41 -0700

There's been some discussions on the pymssql project to abandon it in favor
of pyodbc and that's still ongoing. MS has decided to support pyodbc, so
that's
what we chose in the end.

We did build a wrapper around bcp, which is the fastest way of transferring
the data, but it took a month to stabilize. It's very unfriendly in how it
reports
errors, so we had to invest some time to parse error and data files so you
could at least see what was going on.

There are some nasty flaws with bcp:

* You end up iterating over all rows AND columns to bash the data into the
right shape, mostly because we read from a different database platform into
mssql, so CPU goes up.
* It creates a local file, which means that all data is written and read
once from disk, generating disk I/O spikes. It's basically creating one
unnecessary copy in the process. So you need the throughput and IOPS.
* bcp doesn't have proper encoding support, so it's all ASCII.
* Error reporting is poor.

I added a link to some command file examples in the main README, I think
they demonstrate how it works.

@Tao: Yes, had a look at Gobblin, but that's almost a complete ETL platform
on its own. I wanted something concise and simple.

On Fri, May 31, 2019 at 10:52 PM Felipe Elias Lolas Isuani <
[email protected]> wrote:

> Hi Gerard!
>
> Why not use bcp in/out command directly with Airflow? We currently wrap
> bcp in/out command into an Hook/Operator for working with SQL Server.
> Obiously you need to install the tools in airflow workers.
>
> Cheers,
> Philip
>
> El 31-05-2019, a la(s) 09:57, Gerard Toonstra <[email protected]>
> escribió:
>
> > Hi all,
> >
> > We're running airflow and write data to sqlserver. Python and sqlserver
> > isn't a marriage made in heaven and we've literally spent hours fixing
> > driver issues, installations and working around things that should be
> easy,
> > time that's better spent writing app code.
> >
> > I looked at Sqoop, but that runs on Hadoop only. I didn't find anything
> > that was opensource / standalone to use as a "generic data pump", so
> > started a new project for that (let me know if something does exist). The
> > overall idea is that an airflow operator generates a command file for
> this
> > process and then spawns a java process to run the pipeline.
> >
> > It's available here:
> >
> > https://github.com/gtoonstra/genevere
> >
> > If you have similar issues with your environment, maybe this helps you
> out.
> > And if you want to contribute,  just reach out!
> >
> > Rgds,
> >
> > Gerard
>

Re: Project Genevere

Reply via email to