Thanks! Turns out I was using less space on the VM than I thought; that with a lack of decent error checking and I didn't catch the out-of-space problem. As soon as I added more space, I was able to index everything without a problem.
Thanks again. On Tuesday, September 9, 2014 6:49:35 PM UTC-4, Jörg Prante wrote: > > Code looks okay, so it might be just the full volume that is in the way > > Jörg > > On Tue, Sep 9, 2014 at 8:44 PM, Joshua P <[email protected] > <javascript:>> wrote: > > This is the code I've been using to index: > > I'm going to try to fix the running out of space issue and then try > slimming down settings. Thank you. > > public class Indexer { > > private static final Logger logger = LogManager.getLogger( > "ESBulkUploader"); > > public static void main(String[] args) throws IOException, > NoSuchFieldException { > > DBConnection dbConn = new DBConnection(""); > > String query = "SELECT TOP 300000 * FROM vw_PropertyGeneralInfo > WHERE Country_id = 1 ORDER BY Property_id DESC"; > > System.out.println("getting data"); > List<PropertyGeneralInfoRow> pgiTable = dbConn. > ExecuteQueryWithoutParameters(query); > System.out.println("got data"); > > ObjectMapper mapper = new ObjectMapper(); > > Settings settings = ImmutableSettings.settingsBuilder().put(" > cluster.name", "property_transaction_data").build(); > > Client client = new TransportClient(settings).addTransportAddress( > new InetSocketTransportAddress("192.168.133.131", 9300)); > > BulkProcessor bulkProcessor = BulkProcessor.builder(client, new > BulkProcessor.Listener() { > @Override > public void beforeBulk(long executionId, BulkRequest request) > { > System.out.println("About to index " + request. > numberOfActions() + " records of size " + request.estimatedSizeInBytes() + > "."); > } > > @Override > public void afterBulk(long executionId, BulkRequest request, > BulkResponse response) { > if( response.hasFailures() ){ > for( BulkItemResponse item : response.getItems() ){ > BulkItemResponse.Failure failure = item.getFailure > (); > if( failure != null ){ > System.out.println(failure.getId() + " -- " + > failure.getStatus().name() + " -- " + failure.getMessage() + " -- " + > failure.getType()); > } > } > } > > System.out.println("Successfully indexed " + request. > numberOfActions() + " records in " + response.getTook() + "."); > } > > @Override > public void afterBulk(long executionId, BulkRequest request, > Throwable failure) { > System.out.println("failure somewhere on " + request. > toString()); > failure.printStackTrace(); > logger.warn("failure on " + request.toString()); > } > }).setBulkActions(500).setConcurrentRequests(1).build(); > > for( int i = 0; i < pgiTable.size(); i++ ){ > //prep location field > PropertyGeneralInfoRow pgiRow = pgiTable.get(i); > > Double[] location = {pgiRow.getLon_dbl(), pgiRow.getLat_dbl > ()}; > > geocode geocode = new geocode(); > > geocode.setLocation(location); > > pgiRow.setGeocode(geocode); > > // prep full address string > pgiRow.setFulladdressstring(pgiRow.getPropertykey_tx() + ", " > + > pgiRow.getCity_tx() + ", " + pgiRow.getStateprov_cd() > + > ", " + pgiRow.getCountry_tx() + ", " + pgiRow. > getPostalcode_tx()); > > String jsonRow = mapper.writeValueAsString(pgiRow); > > if( jsonRow != null && !jsonRow.isEmpty() && !jsonRow.equals( > "{}") ){ > bulkProcessor.add(new IndexRequest("rcapropertydata", > "rcaproperty").source(jsonRow.getBytes())); > // > bulkProcessor.add(client.prepareIndex("rcapropertydata", > "rcaproperty").setSource(jsonRow)); > } > else{ > // don't add null strings.. > try{ > System.out.println(pgiRow.toString()); > } > catch (Exception e){ > System.out.println("Some error in the toString() > method..."); > } > System.out.println("Some json output was null. -- " + > pgiRow.getProperty_id().toString()); > } > > } > > bulkProcessor.flush(); > bulkProcessor.close(); > > } > > > > } > > > > On Tuesday, September 9, 2014 1:57:54 PM UTC-4, Jörg Prante wrote: > > Check the path.data setting in config/elasticsearch.yml > > Jörg > > On Tue, Sep 9, 2014 at 7:50 PM, Joshua P <[email protected]> wrote: > > Just reran the indexer and found this error coming up. I'm running out of > disk space on the partition ES wants to write to. > > F38KqHhnRDWtiJCss5Wz0g -- INTERNAL_SERVER_ERROR -- > TranslogException[[index_type][0] Failed to write operation > [org.elasticsearch.index.translog.Translog$Create@6f1f6b1e]]; nested: > IOException[No space left on device]; -- index_type > > Where would I change the write location? Which config file? > > On Tuesday, September 9, 2014 1:28:21 PM UTC-4, Joshua P wrote: > > Hi Jörg, > > Can you elaborate on what you mean by I still need more fine tuning? > > I've upped the heap size to 4g (in both places I mentioned before because > it's not clear to me which one ES actually uses). I haven't tried to index > again yet. > Other than throttling my indexing, what are some other things I need to be > thinking about? > > On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote: > > Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and > indexing around 1 million docs, you need some more fine tuning, which is > complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8 > GB RAM. > > Jörg > > On Tue, Sep 9, 2014 at 5:39 PM, Joshua P <[email protected]> wrote: > > Here is /etc/default/elasticsearch > > # Run Elasticsearch as this user ID and group ID > #ES_USER=elasticsearch > #ES_GROUP=elasticsearch > > # Heap Size (defaults to 256m min, 1g max) > ES_HEAP_SIZE=512m > > # Heap new generation > #ES_HEAP_NEWSIZE= > > # max direct memory > #ES_DIRECT_SIZE= > > # Maximum number of open files, defaults to 65535. > MAX_OPEN_FILES=65535 > > # Maximum locked memory size. Set to "unlimited" if you use the > # bootstrap.mlockall option in elasticsearch.yml. You must also set > # ES_HEAP_SIZE. > MAX_LOCKED_MEMORY=unlimited > > # Maximum number of VMA (Virtual Memory Areas) a process can own > #MAX_MAP_COUNT=262144 > > # Elasticsearch log directory > #LOG_DIR=/var/log/elasticsearch > > # Elasticsearch data directory > #DATA_DIR=/var/lib/elasticsearch > > # Elasticsearch work directory > #WORK_DIR=/tmp/elasticsearch > > # Elasticsearch configuration directory > #CONF_DIR=/etc/elasticsearch > > # Elasticsearch configuration file (elasticsearch.yml) > #CONF_FILE=/etc/elasticsearch/elasticsearch.yml > > # Additional Java OPTS > #ES_JAVA_OPTS= > > # Configure restart on package upgrade (true, every other setting will > lead to not restarting) > #RESTART_ON_UPGRADE=true > > I also see the same setting in /etc/init.d/elasticsearch. Do you know > which file takes priority? And what a good size would be? > > On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote: > > Hello Joshua , > > I am not sure which variable you are referring to on the memory settings > in the config file , please paste the comment and config. > I usually change the config from init.d script. > > Best approach would be to bulk index say 10,000 feeds in sync mode , wait > until is everything is indexed and then proceed to the next batch. > I am not sure about the java API , but long back i used to curl to this > stats API and see how much request was rejected. > > Thanks > Vineeth > > On Tue, Sep 9, 2014 at 8:58 PM, Joshua P <[email protected]> wrote: > > You also said you wouldn't recommend indexing that much information at > once. How would you suggest breaking it up and what status should I look > for before doing another batch? I have to come up with some process that is > repeatable and mostly automated. > > On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote: > > Thanks for the reply, Vineeth! > > What's a practical heap size? I've seen some people saying they set it to > 30gb but this confuses me because in the /etc/default/elasticsearch file, > the comment suggests the max is only 1gb? > > I'll look into the threadpool issue. Is there a Java API for monitoring > Cluster Node health? Can you point me at an example or give me a link to > that? > > Thanks! > > On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote: > > Hello Joshuva , > > I have a feeling this has something to do with the threadpool. > There is a limit on number of feeds to be queued for indexing. > > Try increasing the size of threadpool queue of index and bulk to a large > number. > Also through cluster node API on threadpool, you can see if any request > has failed. > Monitor this API for any failed request due to large volume. > > Threadpool - http://www.elasticsearch.org/guide/en/elasticsearch/refere > nce/current/modules-threadpool.html > Threadpool stats - http://www.elasticsearch.org/guide/en/elasticsearch/ > reference/current/cluster-nodes-stats.html > > Having said that , i wont recommend bulk indexing that much information at > a time and 512 MB is not going to help much. > > Thanks > Vineeth > > On Tue, Sep 9, 2014 at 7:48 PM, Joshua P <[email protected]> wrote: > > Hi there! > > I'm trying to do a one-time index of about 800,000 records into an > instance of elasticsearch. But I'm having a bit of trouble. It continually > fails around 200,000 records. Looking at in the Elasticsearch Head Plugin, > my index goes offline and becomes unrecoverable. > > For now, I have it running on a VM on my personal machine. > > VM Config: > Ubuntu Server 14.04 64-Bit > 8 GB RAM > 2 Processors > 32 GB SSD > > Java > java version "1.7.0_65" > OpenJDK Runtime Environment (IcedTea 2.5.1) (7u65-2.5.1-4ubuntu1~0.14.04.2 > ) > OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) > > Elasticsearch is using mostly the defaults. This is the output of: > curl http://localhost:9200/_nodes/process?pretty > { > "cluster_name" : "property_transaction_data", > "nodes" : { > "KlFkO_qgSOKmV_jjj5xeVw" : { > "name" : "Marvin Flumm", > "transport_address" : "inet[/192.168.133.131:9300]", > "host" : "ubuntu-es", > "ip" : "127.0.1.1", > "version" : "1.3.2", > "build" : "dee175d", > "http_address" : "inet[/192.168.133.131:9200]", > "process" : { > "refresh_interval_in_millis" : 1000, > "id" : 1092, > "max_file_descriptors" : 65535, > "mlockall" : true > } > } > } > } > > I adjusted ES_HEAP_SIZE to 512mb. > > I'm using the following code to pull data from SQL Server and index it. > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit https://groups.google.com/d/ms > gid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit https://groups.google.com/d/ms > gid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > ... -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5ed41ff2-3abd-4e6d-803c-62746ad3c54a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
