This is the code I've been using to index:
I'm going to try to fix the running out of space issue and then try
slimming down settings. Thank you.
public class Indexer {
private static final Logger logger = LogManager.getLogger(
"ESBulkUploader");
public static void main(String[] args) throws IOException,
NoSuchFieldException {
DBConnection dbConn = new DBConnection("");
String query = "SELECT TOP 300000 * FROM vw_PropertyGeneralInfo
WHERE Country_id = 1 ORDER BY Property_id DESC";
System.out.println("getting data");
List<PropertyGeneralInfoRow> pgiTable = dbConn.
ExecuteQueryWithoutParameters(query);
System.out.println("got data");
ObjectMapper mapper = new ObjectMapper();
Settings settings = ImmutableSettings.settingsBuilder().put(
"cluster.name", "property_transaction_data").build();
Client client = new TransportClient(settings).addTransportAddress(
new InetSocketTransportAddress("192.168.133.131", 9300));
BulkProcessor bulkProcessor = BulkProcessor.builder(client, new
BulkProcessor.Listener() {
@Override
public void beforeBulk(long executionId, BulkRequest request) {
System.out.println("About to index " + request.
numberOfActions() + " records of size " + request.estimatedSizeInBytes() +
".");
}
@Override
public void afterBulk(long executionId, BulkRequest request,
BulkResponse response) {
if( response.hasFailures() ){
for( BulkItemResponse item : response.getItems() ){
BulkItemResponse.Failure failure = item.getFailure
();
if( failure != null ){
System.out.println(failure.getId() + " -- " +
failure.getStatus().name() + " -- " + failure.getMessage() + " -- " +
failure.getType());
}
}
}
System.out.println("Successfully indexed " + request.
numberOfActions() + " records in " + response.getTook() + ".");
}
@Override
public void afterBulk(long executionId, BulkRequest request,
Throwable failure) {
System.out.println("failure somewhere on " + request.
toString());
failure.printStackTrace();
logger.warn("failure on " + request.toString());
}
}).setBulkActions(500).setConcurrentRequests(1).build();
for( int i = 0; i < pgiTable.size(); i++ ){
//prep location field
PropertyGeneralInfoRow pgiRow = pgiTable.get(i);
Double[] location = {pgiRow.getLon_dbl(), pgiRow.getLat_dbl()};
geocode geocode = new geocode();
geocode.setLocation(location);
pgiRow.setGeocode(geocode);
// prep full address string
pgiRow.setFulladdressstring(pgiRow.getPropertykey_tx() + ", " +
pgiRow.getCity_tx() + ", " + pgiRow.getStateprov_cd() +
", " + pgiRow.getCountry_tx() + ", " + pgiRow.
getPostalcode_tx());
String jsonRow = mapper.writeValueAsString(pgiRow);
if( jsonRow != null && !jsonRow.isEmpty() && !jsonRow.equals(
"{}") ){
bulkProcessor.add(new IndexRequest("rcapropertydata",
"rcaproperty").source(jsonRow.getBytes()));
// bulkProcessor.add(client.prepareIndex("rcapropertydata",
"rcaproperty").setSource(jsonRow));
}
else{
// don't add null strings..
try{
System.out.println(pgiRow.toString());
}
catch (Exception e){
System.out.println("Some error in the toString()
method...");
}
System.out.println("Some json output was null. -- " + pgiRow
.getProperty_id().toString());
}
}
bulkProcessor.flush();
bulkProcessor.close();
}
}
On Tuesday, September 9, 2014 1:57:54 PM UTC-4, Jörg Prante wrote:
>
> Check the path.data setting in config/elasticsearch.yml
>
> Jörg
>
> On Tue, Sep 9, 2014 at 7:50 PM, Joshua P <[email protected]
> <javascript:>> wrote:
>
>> Just reran the indexer and found this error coming up. I'm running out of
>> disk space on the partition ES wants to write to.
>>
>> F38KqHhnRDWtiJCss5Wz0g -- INTERNAL_SERVER_ERROR --
>> TranslogException[[index_type][0] Failed to write operation
>> [org.elasticsearch.index.translog.Translog$Create@6f1f6b1e]]; nested:
>> IOException[No space left on device]; -- index_type
>>
>> Where would I change the write location? Which config file?
>>
>> On Tuesday, September 9, 2014 1:28:21 PM UTC-4, Joshua P wrote:
>>>
>>> Hi Jörg,
>>>
>>> Can you elaborate on what you mean by I still need more fine tuning?
>>>
>>> I've upped the heap size to 4g (in both places I mentioned before
>>> because it's not clear to me which one ES actually uses). I haven't tried
>>> to index again yet.
>>> Other than throttling my indexing, what are some other things I need to
>>> be thinking about?
>>>
>>> On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:
>>>>
>>>> Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and
>>>> indexing around 1 million docs, you need some more fine tuning, which is
>>>> complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8
>>>> GB RAM.
>>>>
>>>> Jörg
>>>>
>>>> On Tue, Sep 9, 2014 at 5:39 PM, Joshua P <[email protected]> wrote:
>>>>
>>>>> Here is /etc/default/elasticsearch
>>>>>
>>>>> # Run Elasticsearch as this user ID and group ID
>>>>> #ES_USER=elasticsearch
>>>>> #ES_GROUP=elasticsearch
>>>>>
>>>>> # Heap Size (defaults to 256m min, 1g max)
>>>>> ES_HEAP_SIZE=512m
>>>>>
>>>>> # Heap new generation
>>>>> #ES_HEAP_NEWSIZE=
>>>>>
>>>>> # max direct memory
>>>>> #ES_DIRECT_SIZE=
>>>>>
>>>>> # Maximum number of open files, defaults to 65535.
>>>>> MAX_OPEN_FILES=65535
>>>>>
>>>>> # Maximum locked memory size. Set to "unlimited" if you use the
>>>>> # bootstrap.mlockall option in elasticsearch.yml. You must also set
>>>>> # ES_HEAP_SIZE.
>>>>> MAX_LOCKED_MEMORY=unlimited
>>>>>
>>>>> # Maximum number of VMA (Virtual Memory Areas) a process can own
>>>>> #MAX_MAP_COUNT=262144
>>>>>
>>>>> # Elasticsearch log directory
>>>>> #LOG_DIR=/var/log/elasticsearch
>>>>>
>>>>> # Elasticsearch data directory
>>>>> #DATA_DIR=/var/lib/elasticsearch
>>>>>
>>>>> # Elasticsearch work directory
>>>>> #WORK_DIR=/tmp/elasticsearch
>>>>>
>>>>> # Elasticsearch configuration directory
>>>>> #CONF_DIR=/etc/elasticsearch
>>>>>
>>>>> # Elasticsearch configuration file (elasticsearch.yml)
>>>>> #CONF_FILE=/etc/elasticsearch/elasticsearch.yml
>>>>>
>>>>> # Additional Java OPTS
>>>>> #ES_JAVA_OPTS=
>>>>>
>>>>> # Configure restart on package upgrade (true, every other setting will
>>>>> lead to not restarting)
>>>>> #RESTART_ON_UPGRADE=true
>>>>>
>>>>> I also see the same setting in /etc/init.d/elasticsearch. Do you know
>>>>> which file takes priority? And what a good size would be?
>>>>>
>>>>> On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:
>>>>>>
>>>>>> Hello Joshua ,
>>>>>>
>>>>>> I am not sure which variable you are referring to on the memory
>>>>>> settings in the config file , please paste the comment and config.
>>>>>> I usually change the config from init.d script.
>>>>>>
>>>>>> Best approach would be to bulk index say 10,000 feeds in sync mode ,
>>>>>> wait until is everything is indexed and then proceed to the next batch.
>>>>>> I am not sure about the java API , but long back i used to curl to
>>>>>> this stats API and see how much request was rejected.
>>>>>>
>>>>>> Thanks
>>>>>> Vineeth
>>>>>>
>>>>>> On Tue, Sep 9, 2014 at 8:58 PM, Joshua P <[email protected]> wrote:
>>>>>>
>>>>>>> You also said you wouldn't recommend indexing that much information
>>>>>>> at once. How would you suggest breaking it up and what status should I
>>>>>>> look
>>>>>>> for before doing another batch? I have to come up with some process
>>>>>>> that is
>>>>>>> repeatable and mostly automated.
>>>>>>>
>>>>>>> On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:
>>>>>>>>
>>>>>>>> Thanks for the reply, Vineeth!
>>>>>>>>
>>>>>>>> What's a practical heap size? I've seen some people saying they set
>>>>>>>> it to 30gb but this confuses me because in the
>>>>>>>> /etc/default/elasticsearch
>>>>>>>> file, the comment suggests the max is only 1gb?
>>>>>>>>
>>>>>>>> I'll look into the threadpool issue. Is there a Java API for
>>>>>>>> monitoring Cluster Node health? Can you point me at an example or give
>>>>>>>> me a
>>>>>>>> link to that?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hello Joshuva ,
>>>>>>>>>
>>>>>>>>> I have a feeling this has something to do with the threadpool.
>>>>>>>>> There is a limit on number of feeds to be queued for indexing.
>>>>>>>>>
>>>>>>>>> Try increasing the size of threadpool queue of index and bulk to a
>>>>>>>>> large number.
>>>>>>>>> Also through cluster node API on threadpool, you can see if any
>>>>>>>>> request has failed.
>>>>>>>>> Monitor this API for any failed request due to large volume.
>>>>>>>>>
>>>>>>>>> Threadpool - http://www.elasticsearch.org/guide/en/elasticsearch/
>>>>>>>>> reference/current/modules-threadpool.html
>>>>>>>>> Threadpool stats - http://www.elasticsearch.org
>>>>>>>>> /guide/en/elasticsearch/reference/current/cluster-nodes-stats.html
>>>>>>>>>
>>>>>>>>> Having said that , i wont recommend bulk indexing that much
>>>>>>>>> information at a time and 512 MB is not going to help much.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Vineeth
>>>>>>>>>
>>>>>>>>> On Tue, Sep 9, 2014 at 7:48 PM, Joshua P <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi there!
>>>>>>>>>>
>>>>>>>>>> I'm trying to do a one-time index of about 800,000 records into
>>>>>>>>>> an instance of elasticsearch. But I'm having a bit of trouble. It
>>>>>>>>>> continually fails around 200,000 records. Looking at in the
>>>>>>>>>> Elasticsearch
>>>>>>>>>> Head Plugin, my index goes offline and becomes unrecoverable.
>>>>>>>>>>
>>>>>>>>>> For now, I have it running on a VM on my personal machine.
>>>>>>>>>>
>>>>>>>>>> VM Config:
>>>>>>>>>> Ubuntu Server 14.04 64-Bit
>>>>>>>>>> 8 GB RAM
>>>>>>>>>> 2 Processors
>>>>>>>>>> 32 GB SSD
>>>>>>>>>>
>>>>>>>>>> Java
>>>>>>>>>> java version "1.7.0_65"
>>>>>>>>>> OpenJDK Runtime Environment (IcedTea 2.5.1)
>>>>>>>>>> (7u65-2.5.1-4ubuntu1~0.14.04.2)
>>>>>>>>>> OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
>>>>>>>>>>
>>>>>>>>>> Elasticsearch is using mostly the defaults. This is the output
>>>>>>>>>> of:
>>>>>>>>>> curl http://localhost:9200/_nodes/process?pretty
>>>>>>>>>> {
>>>>>>>>>> "cluster_name" : "property_transaction_data",
>>>>>>>>>> "nodes" : {
>>>>>>>>>> "KlFkO_qgSOKmV_jjj5xeVw" : {
>>>>>>>>>> "name" : "Marvin Flumm",
>>>>>>>>>> "transport_address" : "inet[/192.168.133.131:9300]",
>>>>>>>>>> "host" : "ubuntu-es",
>>>>>>>>>> "ip" : "127.0.1.1",
>>>>>>>>>> "version" : "1.3.2",
>>>>>>>>>> "build" : "dee175d",
>>>>>>>>>> "http_address" : "inet[/192.168.133.131:9200]",
>>>>>>>>>> "process" : {
>>>>>>>>>> "refresh_interval_in_millis" : 1000,
>>>>>>>>>> "id" : 1092,
>>>>>>>>>> "max_file_descriptors" : 65535,
>>>>>>>>>> "mlockall" : true
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> I adjusted ES_HEAP_SIZE to 512mb.
>>>>>>>>>>
>>>>>>>>>> I'm using the following code to pull data from SQL Server and
>>>>>>>>>> index it.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>> Google Groups "elasticsearch" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3
>>>>>>>>>> f-462f-bdcf-df717cbc6269%40googlegroups.com
>>>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/0dcac495-a07
>>>>>>> 1-4644-9349-109071fb1855%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%
>>>>> 40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/elasticsearch/1765489f-d2f5-47c5-a499-9633c9be54e2%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7618bd05-9a0f-4248-8f16-0950198473db%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.