On Mon, 25 Jun 2018 at 11:38, Anto Aravinth <anto.aravinth....@gmail.com> wrote:
> > > On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross <theophil...@gmail.com> wrote: > >> >> Anto Aravinth <anto.aravinth....@gmail.com> writes: >> >> > Thanks for the response. I'm not sure, how long does this tool takes for >> > the 70GB data. >> > >> > I used node to stream the xml files into inserts.. which was very slow.. >> > Actually the xml contains 40 million records, out of which 10Million >> took >> > around 2 hrs using nodejs. Hence, I thought will use COPY command, as >> > suggested on the internet. >> > >> > Definitely, will try the code and let you know.. But looks like it uses >> the >> > same INSERT, not copy.. interesting if it runs quick on my machine. >> > >> > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat < >> adrien.nay...@anayrat.info> >> > wrote: >> > >> >> On 06/24/2018 05:25 PM, Anto Aravinth wrote: >> >> > Hello Everyone, >> >> > >> >> > I have downloaded the Stackoverflow posts xml (contains all SO >> questions >> >> till >> >> > date).. the file is around 70GB.. I wanna import the data in those >> xml >> >> to my >> >> > table.. is there a way to do so in postgres? >> >> > >> >> > >> >> > Thanks, >> >> > Anto. >> >> >> >> Hello Anto, >> >> >> >> I used this tool : >> >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres >> >> >> >> If you are using nodejs, then you can easily use the pg-copy-streams >> module to insert the records into your database. I've been using this >> for inserting large numbers of records from NetCDF files. Takes between >> 40 to 50 minutes to insert 60 Million+ records and we are doing >> additional calculations on the values, not just inserting them, >> plus we are inserting into a database over the network and into a >> database which is >> also performing other processing. >> >> We found a significant speed improvement with COPY over blocks of insert >> transactions, which was faster than just individual inserts. The only >> downside with using COPY is that it either completely works or >> completely fails and when it fails, it can be tricky to work out which >> record is causing the failure. A benefit of using blocks of transactions >> is that you have more fine grained control, allowing you to recover from >> some errors or providing more specific detail regarding the cause of the >> error. >> > > Sure, let me try that.. I have a question here, COPY usually works when > you move data from files to your postgres instance, right? Now in node.js, > processing the whole file, can I use COPY > programmatically like COPY Stackoverflow <calculated value at run time>? > Because from doc: > > https://www.postgresql.org/docs/9.2/static/sql-copy.html > > I don't see its possible. May be I need to convert the files to copy > understandable first? > > Anto. > >> >> > Yes. Essentially what you do is create a stream and feed whatever information you want to copy into that stream. PG sees the. data as if it was seeing each line in a file, so you push data onto the stream wherre each item is seperated by a tab (or whatever). Here is the basic low level function I use (Don't know how the formatting will go!) async function copyInsert(sql, stringifyFN, records) { const logName = `${moduleName}.copyInsert`; var client; assert.ok(Array.isArray(records), "The records arg must be an array"); assert.ok(typeof(stringifyFN) === "function", "The stringifyFN arg must be a function"); return getClient() .then(c => { client = c; return new Promise(function(resolve, reject) { var stream, rs; var idx = 0; function done() { releaseClient(client); client = undefined; resolve(idx + 1); } function onError(err) { if (client !== undefined) { releaseClient(client); } reject(new VError(err, `${logName}: COPY failed at record ${idx}`)); } function arrayRead() { if (idx === records.length) { rs.push(null); } else { let rec = records[idx]; rs.push(stringifyFN(rec)); idx += 1; } } rs = new Readable; rs._read = arrayRead; rs.on("error", onError); stream = client.query(copyFrom(sql)); stream.on("error", onError); stream.on("end", done); rs.pipe(stream); }); }) .catch(err => { throw new VError(err, `${logName} Failed COPY insert`); }); } and I will call it like copyInsert(sql, stringifyClimateRecord, records) where sql and stringifycomateRecord arguments are const sql = `COPY access_s.climate_data_ensemble_${ensemble} ` + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds," + "vprp_09,vprp_15,wind_speed) FROM STDIN"; function stringifyClimateRecord(rec) { return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t` + `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`; } The stringifyClimateRecord returns a record to be inserted as a 'line' into the stream with values separated by tabs. Records is an array of data records where each record is an array. -- regards, Tim -- Tim Cross