Re: [Qemu-devel] [RFC 00/10] MultiThread TCG.

Alex Bennée Thu, 23 Apr 2015 08:45:08 -0700

Frederic Konrad <fred.kon...@greensocs.com> writes:

> On 10/04/2015 18:03, Frederic Konrad wrote:
>> On 30/03/2015 23:46, Peter Maydell wrote:
>>> On 30 March 2015 at 07:52, Mark Burton <mark.bur...@greensocs.com> 
>>> wrote:
>>>> So - Fred is unwilling to send the patch set as it stands, because 
>>>> frankly this part is totally broken.
>>>>
>>>> There is an independent patch set that needs splitting out which 
>>>> deals with just the atomic instruction issue - specifically for ARM 
>>>> (though I guess it’s applicable across the board)…
>>>>
>>>> So - in short - I HOPE to get the patch set onto the reflector 
>>>> sometime next week, and I’m sorry for the delay.
>>> What I really want to see is not so much the patch set
>>> but the design sketch I asked for that lists the
>>> various data structures and indicates which ones
>>> are going to be per-cpu, which ones will be shared
>>> (and with what locking), etc.
>>>
>>> -- PMM
>
> Does that makes sense?
>
> BTW here is the repository:
> git clone g...@git.greensocs.com:fkonrad/mttcg.git -b multi_tcg_v4


Is there a non-authenticated read-only http or git:// access to this repo?

>
> Thanks,
> Fred
>
>> Hi everybody,
>> Hi Peter,
>>
>> I tried to recap what we did, how it "works" and what the status:
>>
>> All the mechanism are basically unchanged.
>>
>> A lot of TCG structures are not thread safe.
>> And all TCG threads can run at the same times and sometimes want to 
>> generate
>> code at the same time.
>>
>> Translation block related structure:
>>
>> struct TBContext {
>>
>>     TranslationBlock *tbs;
>>     TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
>>     int nb_tbs;
>>     /* any access to the tbs or the page table must use this lock */
>>     QemuMutex tb_lock;
>>
>>     /* statistics */
>>     int tb_flush_count;
>>     int tb_phys_invalidate_count;
>>
>>     int tb_invalidated_flag;
>> };
>>
>> This structure is used in TCGContext: TBContext tb_ctx;
>>
>> "tbs" is basically where the translated block are stored and 
>> tb_phys_hash an
>> hash table to find them quickly.
>>
>> There are two solutions to prevent thread issues:
>>   A/ Just have two tb_ctx.
>>   B/ Share it between CPUs and protect the tb_ctx access.
>>
>> We took the second solution so all CPUs can benefit of the translated TB.
>> TBContext is written almost everywhere in translate-all.c.
>> When there are too much tbs a tb_flush occurs and destroy the array. 
>> We don't
>> handle this case right now.
>> tb_lock is already used by user-mode code, so we just convert it to 
>> QemuMutex so
>> we can reuse it in system-mode.
>>
>> struct TCGContext {
>>     uint8_t *pool_cur, *pool_end;
>>     TCGPool *pool_first, *pool_current, *pool_first_large;
>>     TCGLabel *labels;
>>     int nb_labels;
>>     int nb_globals;
>>     int nb_temps;
>>
>>     /* goto_tb support */
>>     tcg_insn_unit *code_buf;
>>     uintptr_t *tb_next;
>>     uint16_t *tb_next_offset;
>>     uint16_t *tb_jmp_offset; /* != NULL if USE_DIRECT_JUMP */
>>
>>     /* liveness analysis */
>>     uint16_t *op_dead_args; /* for each operation, each bit tells if the
>>                                corresponding argument is dead */
>>     uint8_t *op_sync_args;  /* for each operation, each bit tells if the
>>                                corresponding output argument needs to be
>>                                sync to memory. */
>>
>>     /* tells in which temporary a given register is. It does not take
>>        into account fixed registers */
>>     int reg_to_temp[TCG_TARGET_NB_REGS];
>>     TCGRegSet reserved_regs;
>>     intptr_t current_frame_offset;
>>     intptr_t frame_start;
>>     intptr_t frame_end;
>>     int frame_reg;
>>
>>     tcg_insn_unit *code_ptr;
>>     TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
>>     TCGTempSet free_temps[TCG_TYPE_COUNT * 2];
>>
>>     GHashTable *helpers;
>>
>> #ifdef CONFIG_PROFILER
>>     /* profiling info */
>>     int64_t tb_count1;
>>     int64_t tb_count;
>>     int64_t op_count; /* total insn count */
>>     int op_count_max; /* max insn per TB */
>>     int64_t temp_count;
>>     int temp_count_max;
>>     int64_t del_op_count;
>>     int64_t code_in_len;
>>     int64_t code_out_len;
>>     int64_t interm_time;
>>     int64_t code_time;
>>     int64_t la_time;
>>     int64_t opt_time;
>>     int64_t restore_count;
>>     int64_t restore_time;
>> #endif
>>
>> #ifdef CONFIG_DEBUG_TCG
>>     int temps_in_use;
>>     int goto_tb_issue_mask;
>> #endif
>>
>>     uint16_t gen_opc_buf[OPC_BUF_SIZE];
>>     TCGArg gen_opparam_buf[OPPARAM_BUF_SIZE];
>>
>>     uint16_t *gen_opc_ptr;
>>     TCGArg *gen_opparam_ptr;
>>     target_ulong gen_opc_pc[OPC_BUF_SIZE];
>>     uint16_t gen_opc_icount[OPC_BUF_SIZE];
>>     uint8_t gen_opc_instr_start[OPC_BUF_SIZE];
>>
>>     /* Code generation.  Note that we specifically do not use 
>> tcg_insn_unit
>>        here, because there's too much arithmetic throughout that relies
>>        on addition and subtraction working on bytes.  Rely on the GCC
>>        extension that allows arithmetic on void*.  */
>>     int code_gen_max_blocks;
>>     void *code_gen_prologue;
>>     void *code_gen_buffer;
>>     size_t code_gen_buffer_size;
>>     /* threshold to flush the translated code buffer */
>>     size_t code_gen_buffer_max_size;
>>     void *code_gen_ptr;
>>
>>     TBContext tb_ctx;
>>
>>     /* The TCGBackendData structure is private to tcg-target.c. */
>>     struct TCGBackendData *be;
>> };
>>
>> This structure is used to translate the TBs.
>> The easier solution was to protect the generation of the code to only 
>> allow one
>> CPU to generate code at a time. This is normal as we don't want double 
>> generated
>> tb in the pool anyway. This is achieved with the tb_lock used above.
>>
>> TLB:
>>
>> TLB seems to be CPU dependant, so it is not really a problem as in our
>> implementation one CPU = one pthread. But sometimes a CPU wants to 
>> flush TLB,
>> through an instruction for example. It is very likely an other CPU in 
>> an other
>> thread is executing code at the same time. That's why we choose to 
>> create a
>> tlb_flush_mechanism:
>> When a CPU wants to flush it asks and wait all CPU to exit TCG and 
>> then exit
>> itself. This can be reused for tb_invalidate and or tb_flush as well.
>>
>> Atomic instructions:
>>
>> Atomic instructions are quite hard to implement.
>> The TranslationBlock implementing the atomic instruction can't be 
>> interrupted
>> during the execution (eg: by an interrupt or a signal) cmpxchg64 
>> helper is used
>> for that.
>>
>> QEMU's global lock:
>>
>> TCG thread take the lock during code execution. This is not ok for 
>> multi-thread
>> because that means only one thread will be running at a time. That's 
>> why we took
>> Jan's patch to allow TCG to run without the lock and take it when needed.
>>
>> What is the status:
>>
>>  * We can start a vexpress-a15 simulation with two A15 and run two 
>> dhrystones at
>>    a time, the performance are increased it's quite stable.
>>
>> What is missing:
>>
>>  * tb_flush is not implemented correctly.
>>  * PageDesc structure is not protected the patch which introduced a 
>> first_tb
>>    array was not the right approach and is removed. This implies that
>>    tb_invalidate is broken.
>>
>> For both issues we plan to use the same mechanism as tlb_flush: 
>> exiting all the
>> CPU, flushing, invalidating and let them continue. A generic mechanism 
>> must be
>> implemented for that.
>>
>> Known issues:
>>
>>  * GDB stub is broken because it uses tb_invalidate and we didn't 
>> implement that
>>    for now, and there are probably other issues.
>>  * SMP > 2 crashes, probably because of tb_invalidate as well.
>>  * We don't know the status of the user code, which is probably broken 
>> by our
>>    changes.
>>

-- 
Alex Bennée

Re: [Qemu-devel] [RFC 00/10] MultiThread TCG.

Reply via email to