Re: [GRASS-stats] Loading a point-vector table with 466 columns

Roger Bivand Wed, 27 May 2009 01:14:29 -0700

On Tue, 26 May 2009, Nikos Alexandris wrote:

(Cc to Even Roualt ; Apologies to Even since he is not subscribed in the
list)


Roger:

Three minutes instead of thirty+ suggests that the OGR
plugin has trouble with SQLite as the DB format. So maybe
the default for plugin= should be FALSE, not NULL and automatic
use if present?


--%<--

Could you, Nikos,
make a script generating a similar table in spearfish, and two small
scripts exercising the problem (export to R with the plugin, and with
the temporary shapefile.


* The "problem" exists also with the default DBF as a back-end. I
created 1000 random points, filled less than half of the records with
random numbers and readVECT6("x", plugin=TRUE) takes again too much. I
broke the process since it was running for more than 20 mins.

OK. With 250 rows and 250 columns, I see an order of magnitude saving withplugin=FALSE. In plugin=FALSE, the times are split equally between writingthe temporary file from GRASS with v.out.ogr, and reading it into R withreadOGR(), as one might expect (that is all readVECT6(..., plugin=FALSE)is doing). Even on a small vector (bugsites, 90 points, 2 attributecolumns), plugin=FALSE is faster than plugin=TRUE by about 0.75 : 1.35,not quite twice. Which way does the problem scale, in numbers of features,numbers of attribute columns, or both?

Next script in R generating increasing NR and NC cases throughwriteVECT6() to test plugin=FALSE/plugin=TRUE ratios?


Roger


* A script is pasted on the bottom which has a small "bug" (details
below) :-)


First some results for 1000 rows by 500 columns:

system.time(random_points <- readVECT6("random_points_1000",

plugin=TRUE))
OGR data source with driver: GRASS
Source: "/geo/grassdb/spearfish60/user1/vector/random_points_1000/head",
layer: "1"
with  1000  rows and  501  columns
^C
### This was running for more than 10 hours !!! ###

system.time(random_points <- readVECT6("random_points_1000",

plugin=FALSE))
Exporting 1000 points/lines...
100%
1000 features written
OGR data source with driver: ESRI Shapefile
Source: "/geo/grassdb/spearfish60/user1/.tmp/vertical", layer:
"random_p"
with  1000  rows and  501  columns
Feature type: wkbPoint with 2 dimensions
  user  system elapsed
62.515   9.256  74.013

system.time(random_points <- read.csv("random_points_1000_table.csv"))

  user  system elapsed
 0.192   0.000   0.192



* A script to generate "some" random points, add columns and some R-code
to load with readVECT6( plugin = TRUE ), readVECT6( plugin = FALSE ) and
read.csv.

* The "bug" is that while the variable NUMBER="$[ ( $RANDOM % 100 ) +
1 ]" runs ok under the CLI, it doesn't work from within the bash
script!? So I've commented the respective line and use a fixed number
instead.

--%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-
#!/bin/bash

# example that  readVECT6 ( x , plugin = TRUE )  is too slow
# (also) using the default DBF driver
# first enter in spearfish60/user1

# try with a different back-end?
# db.connect driver=sqlite database=

# set numbers here:
RANDOM_POINTS=100 ; RANDOM_POINTS_CATS=100 ; NUMBER=111

# create RANDOM_POINTS random points
v.random --o output=random_points_`echo ${RANDOM_POINTS}` n=`echo
${RANDOM_POINTS}`

# add in database
v.db.addtable random_points_`echo ${RANDOM_POINTS}`


# add   $"{RANDOM_POINTS}"   columns
echo "\n* Adding ${RANDOM_POINTS} columns"
for x in `seq 1 ${RANDOM_POINTS}` ; do
v.db.addcol random_points_`echo ${RANDOM_POINTS}` column="col_"${x}"
integer"
done ; echo "\n* ${RANDOM_POINTS} columns added"


# check if columns are added
v.info -c random_points_${RANDOM_POINTS}



## WARNING: double loop below takes too long!
# --%<--
# It is simpler and faster to use a single loop with a fixed value
instead, e.g.:
 #for COL in `seq 1 5 ${RANDOM_POINTS}` ; do
 # v.db.update random_points_${RANDOM_POINTS} column="col_"${COL}""
value=222
 #done
# --%<--


# fill some columns/cats with random numbers between 1 and 100
# alter sequence as desired ; more numbers = more time to load in R
for COL in `seq 1 10 ${RANDOM_POINTS}` ; do
for CAT in `seq 1 10 ${RANDOM_POINTS_CATS}` ; do
 # this is ok in the command line but NOT when running the script?
 #NUMBER="$[ ( $RANDOM % 100 ) + 1 ]"
 v.db.update random_points_${RANDOM_POINTS} column="col_"${COL}""
value=${NUMBER} where="cat="${CAT}""
done
done


# [optional] fill in some "-999" values to use as NAs in R?
#NAN=-999
#for COL in `seq 1 5 $"{RANDOM_POINTS}"` ; do
# for CAT in `seq 1 5 $"{RANDOM_POINTS_CATS}"` ; do
#  v.db.update random_points_$"{RANDOM_POINTS}" column="col_"${COL}""
value=$"{NAN}" where="cat="${CAT}""
# done
#done

# check with v.db.select
# v.db.select random_points_${RANDOM_POINTS} | head -25

# export table as .csv file
db.out.ogr in=random_points_${RANDOM_POINTS} format=CSV
dsn=/geo/grassdb/spearfish60/random_points_csv_files
db_table=random_points_${RANDOM_POINTS}.csv

### end of bash script ###


## launch R
R
### R code

# load in R with:
library(spgrass6) ; G <- gmeta6()

#a. readVECT6()
system.time ( random_points <- readVECT6 ( "random_points_100" , plugin
= FALSE ) )

#b. plugin=TRUE
system.time ( random_points <- readVECT6 ( "random_points_100" , plugin
= TRUE ) )

#c. as a csv table
# adjust as required
setwd("/geo/grassdb/spearfish60/random_points_csv_files")
table_to_read <- dir ( pattern = "^random.*.csv$" )
system.time ( random_points <- read.csv ( table_to_read ) )
str(random_points)
--%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<---%<-


--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: [email protected]

_______________________________________________
grass-stats mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/grass-stats

Re: [GRASS-stats] Loading a point-vector table with 466 columns

Reply via email to