Re: [postgis-users] "Linux" geocoder script ?

Don Thu, 14 Apr 2011 01:20:46 -0700

I was running out of hard disk space on my root partition. I have,for now, the geocoder database on a separate partition and hard drive.I was debugging and capturing the output of every query and command. Soit generated a huge file in the gigabyte range especially when eachquery was aborted because it was not UTF-8. The iconv command appearsto be working, albeit though twice. I will be posting a bug report withan attached patch shortly. I think I deleted the windows section, so ifyou patch it with that, you will have to add that back. The iconv isnot in the patch. I think it will probably be a while before mostdistributions have the new version of shp2pgsql. I thought that I hadthe latest but it seems it is only available from subversion at this time.It is really nice having my own geocoder and tiger2010 database.Thanks for everyones help. I can also use the data in my favorite GISprogram qgis to draw maps, and even my own mapserver.


On 04/13/2011 07:15 PM, Sylvain Racine wrote:

Hello,
I saw a little error in your script. If you "pipe" your output toiconv, you have to remove <-W "latin1"> from shp2pgsql command. Youshould have$PGBIN/shp2pgsql -c -s 4269 -g the_geom tl_2010_27_county10.dbftiger_staging.mn_county10 | iconv -f latin1 -t UTF8 | $PGBIN/psql -d$PGDATABASE
If you don't remove the -W flag, your data will be converted twice!
About your running out of disk space, are you sure that you don't lackof RAM memory instead? I am surprise you ran out of disk space withthis command. I think that the output from shp2pgsql is put in RAMmemory and then, send to iconv in one shot. If your RAM memory is low,it is possible that the extra data are put in a temporary file on yourdisk drive (in the swap partition I think). But this temp fileindicated that you first lack of RAM memory. I am not complete sure ofmy theory. But in my mind, I think Linux works like that.
If you want to know the amount of memory needed by this script, type this:
$PGBIN/shp2pgsql -c -s 4269 -g the_geom tl_2010_27_county10.dbftiger_staging.mn_county10 >> temp.sql
temp.sql >> iconv -f latin1 -t UTF8 >> anothertemp.sql
anothertemp.sql >> $PGBIN/psql -d $PGDATABASE
The size of the max file between temp.sql and anothertemp.sql is theminimun amount of RAM you need to run this command.
Note: The above script is slower than the first one because you writeyou data on your disk drive between each operation. SATA or IDE diskaccess is always slower than RAM access, except if you use SSD disk.
Hope it will be useful.

Sylvain Racine

On 13/04/2011 05:40, Don wrote:
Here is the version:
RELEASE: 2.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1 (r$Id:shp2pgsql-core.h 6358 2010-12-13 20:09:26Z pramsey $)
USAGE: shp2pgsql [<options>] <shapefile> [<schema>.]<table>
I had tried the iconv fix which is a great idea. It seemed to work,but I ran out of disk space.$PGBIN/shp2pgsql -c -s 4269 -g the_geom -W "latin1"tl_2010_27_county10.dbf tiger_staging.mn_county10 | iconv -f latin1-t UTF8 | $PGBIN/psql -d $PGDATABASESo I have been spending time on creating more disk space instead offixing this.I would like to share the patches that I have so that others can tryto improve it especially those who are more familiar with thetiger2010 file formats.It would be nice to have a working tiger2010 decoder for linux in thenext postgis release.
I don't remember offhand which state had the utf problem. Several ofthem did.I would run ./tiger_load >& del with a "set -x" in the file fordebugging. (You can then search that file for "aborted" andbacktrack and find out which file it was processing.) It got ratherlarge and so did my log files.I was trying to get more info on this when I ran out of disk space.I could not even vaccuum any databases. I see many large files in mypostgres directories and wonder what they are all for. I have beenusing 2 different database clusters one on a new larger drive for thegeocoder. It seems that my cluster on the small drive has a lot offiles on it thought that are taking up a lot of space.
On 04/13/2011 01:32 AM, Paragon Corporation wrote:
Don,
Which state were you processing? I can check it out and see if Iget similar errors on my shp2pgsql. You could be right and the filejust isn't Latin1.
The regress test did seem to pass for me once that ticket was fixed.
Also to confirm you are running the latest version of shp2pgsql
If you run
shp2pgsql from commandline, it should output the version. Mine forexample readsRELEASE: 2.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1 (r$Id:shp2pgsql-core.h 6925 2011-03-18 16:24:33Z pramsey $)The version unfortunately isn't quite accurate since its evidentallylooking at the .h file instead of .c file. So though my versionsays 6925, its really
6932 or later.
http://trac.osgeo.org/postgis/changeset/6932
Hope that helps,
Regina
http://www.postgis.us
------------------------------------------------------------------------
*From:* [email protected][mailto:[email protected]] *On Behalf Of*Don
*Sent:* Tuesday, April 12, 2011 3:08 AM
*To:* PostGIS Users Discussion
*Subject:* Re: [postgis-users] "Linux" geocoder script ?

My database is encoded as
 geocoder  | drh      | UTF8     | C         | en_US.UTF-8 | .
All my shp2pgsql statements have the -W option like this.
${loader} -a -s 4269 -g the_geom -W "latin1" $z${staging_schema}.${state_abbrev}_${table_name} | $PGBIN/psql -d$PGDATABASE;
Here is the bug that I was referring to.
http://trac.osgeo.org/postgis/ticket/808
In one case I had a very large number of inserts processed for theshape file and then got that error.
From your link it says:
"To enable automatic character set conversion, you have to tellPostgreSQL the character set (encoding) you would like to use in theclient. There are several ways to accomplish this: "
Perhaps I need to use
SET CLIENT_ENCODING TO '/value/'; in psql or is shp2pgsql supposed to do that 
when I use the -W option?
postgis is expecting utf-8 when it should be expecting  latin1 and converting 
it to utf-8.
Could data type for a column have some effect on this?


On 04/11/2011 08:52 PM, Sylvain Racine wrote:
Hello,
This is not a shp2pgsql bug. You get this error when you try toinsert string data in PostgreSQL from another encoding that the oneof your database Ex: Your data is formatted in Latin1 (ISO-8859-1)and you insert them in a UTF-8 database. To fix the error message,you need to convert your data.
PostgreSQL have a internal converter. shp2pgsql have it too. Tryshp2pgsql -W <encoding> where <encoding> is the format of you DBasefile .dbf. This is called the "client encoding" in PostgreSQL. Seelist of valid encoding type:
http://www.postgresql.org/docs/9.0/static/multibyte.html
Don't mix it with the database encoding. It is the one you us tocreate your databse. There is also a default database charset,depending of your OS. It is the one you use to create template1database in init-db. Mine is "UTF8" on Ubuntu.
Hope that this information will help you

Regards

Sylvain Racine

On 2011-04-11 21:22, Don wrote:
I have got the tiger2010 geodecoder to work on my Opensuse system.
geocoder=#
geocoder=# SELECT g.rating,
geocoder-#         ST_X(geomout) As lon,
geocoder-#         ST_Y(geomout) As lat, (addy).*
geocoder-# FROM geocode('1731 New Hampshire Avenue Northwest,Washington, DC 20010') As g;rating | lon | lat | address |predirabbrev | streetname | streettypeabbrev | postdirabbrev |internal | location | stateabbrev | zip | parsed--------+-------------------+------------------+---------+--------------+---------------+------------------+---------------+----------+------------+-------------+-------+--------0 | -77.0399013800607 | 38.9134181361424 | 1731| | New Hampshire | Ave | NW| | Washington | DC | 20009 | t
(1 row)
There are a few glitches. I noticed that I am getting thismessage sometimes.
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
ERROR:  invalid byte sequence for encoding "UTF8": 0xed6f20
HINT: This error can also happen if the byte sequence does notmatch the encoding expected by the server, which is controlled by"client_encoding".ERROR: current transaction is aborted, commands ignored until endof transaction blockERROR: current transaction is aborted, commands ignored until endof transaction blockERROR: current transaction is aborted, commands ignored until endof transaction block
I researched this some and it appears to be a  shp2pgsql bug.
But I am using postgis-utils-2.0.0SVN-1.2.x86_64
postgis-2.0.0SVN-1.2.x86_64 where this has supposedly beenfixed. Or could the census data be corrupted?
So I have "lost" some of the data due to this error.
I had problems with psql generating ctrl-m instead of \n whichwould really mess up the script when it ran.
So after I generated my load tiger script I ran this command
tr "\r" "\n" < load_tiger > load_tiger2

_______________________________________________
postgis-users mailing list
[email protected]
http://postgis.refractions.net/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
[email protected]
http://postgis.refractions.net/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
[email protected]
http://postgis.refractions.net/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
[email protected]
http://postgis.refractions.net/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
[email protected]
http://postgis.refractions.net/mailman/listinfo/postgis-users

_______________________________________________
postgis-users mailing list
[email protected]
http://postgis.refractions.net/mailman/listinfo/postgis-users

Re: [postgis-users] "Linux" geocoder script ?

Reply via email to