There's been some discussions on the pymssql project to abandon it in favor of pyodbc and that's still ongoing. MS has decided to support pyodbc, so that's what we chose in the end.
We did build a wrapper around bcp, which is the fastest way of transferring the data, but it took a month to stabilize. It's very unfriendly in how it reports errors, so we had to invest some time to parse error and data files so you could at least see what was going on. There are some nasty flaws with bcp: * You end up iterating over all rows AND columns to bash the data into the right shape, mostly because we read from a different database platform into mssql, so CPU goes up. * It creates a local file, which means that all data is written and read once from disk, generating disk I/O spikes. It's basically creating one unnecessary copy in the process. So you need the throughput and IOPS. * bcp doesn't have proper encoding support, so it's all ASCII. * Error reporting is poor. I added a link to some command file examples in the main README, I think they demonstrate how it works. @Tao: Yes, had a look at Gobblin, but that's almost a complete ETL platform on its own. I wanted something concise and simple. On Fri, May 31, 2019 at 10:52 PM Felipe Elias Lolas Isuani < [email protected]> wrote: > Hi Gerard! > > Why not use bcp in/out command directly with Airflow? We currently wrap > bcp in/out command into an Hook/Operator for working with SQL Server. > Obiously you need to install the tools in airflow workers. > > Cheers, > Philip > > El 31-05-2019, a la(s) 09:57, Gerard Toonstra <[email protected]> > escribió: > > > Hi all, > > > > We're running airflow and write data to sqlserver. Python and sqlserver > > isn't a marriage made in heaven and we've literally spent hours fixing > > driver issues, installations and working around things that should be > easy, > > time that's better spent writing app code. > > > > I looked at Sqoop, but that runs on Hadoop only. I didn't find anything > > that was opensource / standalone to use as a "generic data pump", so > > started a new project for that (let me know if something does exist). The > > overall idea is that an airflow operator generates a command file for > this > > process and then spawns a java process to run the pipeline. > > > > It's available here: > > > > https://github.com/gtoonstra/genevere > > > > If you have similar issues with your environment, maybe this helps you > out. > > And if you want to contribute, just reach out! > > > > Rgds, > > > > Gerard >
